AD-A277  313 


Approved  for  public  release  : 
distribution  unlimited. 


2*.  SeCUmTY  CLASSIflCATlON  AO 


2b.  DECLASSIFICATION /DOW 


4.  PERFORMING  ORGANI2AT 


6«.  NAME  OF  PERFORMING  ORGHHATION 

Carnegie  Mellon  University 


6c.  ADDRESS  (City,  Sutt,  tnd  ZIP  Codt) 

5000  Forbes  Avenue 
Pittsburgh,  PA  15213 


8a.  NAME  OF  FUNDING /SPONSORING 

organi2:ation 

AFOSR 


form 

OMB  No  0704-0199 


i.  0<ST«l8UTlON/AVAILA8tUTV  Of  RtPOBT 

Approved  for  public  release: 
distribution  umlimited. 


S.  MONITORING  ORGANIZATION  REPORT  NUMaER(S) 

AFOSR-TT*. 


7a.  NAME  OF  MONITORING  ORGANIZATION 

Air  Force  Office  of  Scientific  Research 


7b.  ADDRESS  (C/Cy,  Sta(«,  »nd  ZIP  Cod«) 

Program  Manager,  Neural  Networks 
Bolling  AFB,  D.C.  20332-6448 


eb.  OFFICE  SYMBOL  9.  PROCUREMENT  INSTRUMENT  IDENTIFICATION  NUMBER 
(ft  apptkablt) 

NE  AFOSR-89-0551 


8c  ADDRESS  COty.  Stst*.  tnd  ZIP  Code) 

Building  410 

Bolling  AFB,  D.C.  20332-6448 


to.  SOURCE  OF  FUNDING  NUMBERS 


PROGRAM 
ELEMENT  NO. 


S3 


WORK  UNIT 
ACCESSION  NO. 


1 1.  TITLE  (Indude  Security  Oetslfketion) 

A  Differential  Theory  of  Learning  for  Efficient  Statistical  Pattern  Recognition 


tZ.  PERSONAL  AUTHOR(S: 


'5olin  Hampshire  and  B.V.K.  Vijaya  Kumar 


13a.  TYPE  OF  REPORT 

Fi  nal 


16.  SUPPLEMENTARY  NOTATION 

None 


14.  DATE  OF  REPORT  (Yeer. Montft, Osy)  IIS.  PAGE  COUNT 

12/15/93  I  455 


17.  COSATI  CODES _ I  18.  SUBJECT  TERMS  (Continvc  on  rrvc/>« /f  naceoaiy  and  Ment/iy  by  bfodt  niwnbeO 

ue-GRoup  Learning,  Pattern  Recognition,  Classification, 
Neural  Networks 


19.  ABSTRACT  (Continue  on  reeeae  Hnecetury  end  Identify  by  block  number) 

Probabilistic  learning  strategies  currently  used  are  inefficient,  requiring  high 
classifier  complexity  and  large  training  samples.  In  this  report,  we  introduce  and 
analyze  an  asymptotically  efficient  differential  learning  strategy.  It  guarantees 
the  best  generalization  allowed  by  the  chosen  classifier  paradigm.  Differential 
learning  also  requires  the  classifier  with  minimal  complexity.  The  theory  is 
demonstrated  in  several  real-world  machine  learning/pattern  recognition  tasks. 


,  f,  color  . 

.ui !». 


20.  OISTRMUTiON/AVAIIAOlUTV  Of  ABSTRACT  Ijl.  ABSTRACT  SECURITY  CLASSIFICATION 

□  UNOASSIf lEOAJNUMITED  Q  SAME  AS  RPT.  □OTKUSERsI  UncldSSified 


22«. 


r 


OO  Form  1473,  JUN  t6 


Ppeuhutedhhntoreobtolete. 
xync  quautt 


..  Unclassified^ 

!94  3  23  00  8 


THIS  DOCUMENT  IS  BEST 
QUALITY  AVAEABLE.  THE  COPY 
FURNISHED  TO  DTIC  CONTAINED 
A  SIGNIFICANT  NUMBER  OF 
COLOR  PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY  ON  BLACK 
AND  WHITE  MICROnCHE. 


Approved  for  public  release  s 
distribution  unlimited. 


afosr-tr- 


0073 


Contents 


Overview 

1 . 1  Abstract . 

1 .2  Intended  Audience . 

1 .3  Outline  of  the  Text . 

1.3.1  Summary  of  Findings 

1 .3.2  Profile  of  the  Chapters 


Aooeaslon  For  | 

STIS  GRA4I 

0^ 

DTIC  TAB 

□ 

Unaanouaoed 

□ 

Jastlfiaatiou- 

Bv 

D1 Btr 1 but 1 on^ 

..w 

ATallability 

lv»il  and/or 
Special 


1  Theory  11 

2  ProbabOistk  and  Differential  Strategies  for  Learning  the  Bayesian  Discriminant  Function  13 

2.1  Introduction .  13 

2.2  Bayesian  Discrimination .  14 

2.2. 1  The  Classifier  and  the  Bayesian  Discriminant  Function .  16 

2.2.2  Probabilistic  and  Differential  forms  of  the  Bayesian  Discriminant  Function .  18 

2.2.3  Learning  Paradigms  for  the  Bayesian  Discriminant  Function  .  23 

2.2.4  The  Link  Between  Objective  Function  and  Learning  Strategy .  28 

2.3  Probabilistic  Learning  Ap .  31 

2.3. 1  The  General  Error  Measure . 32 

2.3.2  Specific  Strictly  Probabilistic  Error  Measures .  36 

2.3.3  Minkowski-r  Power  Metrics  and  Other  Common  Error  Measures .  41 

2.4  Differential  Learning  Aa  .  43 

2.4.1  Further  Constraints  Imposed  on  0  by  the  Discriminator .  SO 

2.5  Summary .  51 

3  Differential  Learning  is  Asymptotkally  Efficient  53 

3.1  Introduction .  53 

3.2  Discriminant  Error,  the  Efficient  Classifier,  and  the  Efficient  Learning  Strategy .  54 

3.2  1  Learning  and  Expectation .  55 

3.2.2  Discriminant  Error  and  the  Efficient  Classifier .  58 

3.2.3  Efficient  Learning .  62 


94-09159 

iiiiiniii 


VI  I 


3.3  Differential  Learning  is  Asymptotically  Efficient .  66 

3.3. 1  Differential  Learning  Generates  Consistent  Classifiers .  69 

3.3.2  A  Word  Regarding  “Agnostic”  Learning .  70 

3.4  Discriminant  Error  Versus  Functional  Error,  and  the  Inefficiency  of  Probabilistic  Learning  .  7 1 

3.5  Differential  Learning  Requires  the  Minimum-Complexity  Classifier .  72 

3.6  The  Case  for  Probabilistic  Learning .  /6 

3.6. 1  Assessing  the  Asymptotic  Relative  Efficiency  (ARE)  of  a  non-Differential  Learning 

Strategy .  79 

3.6.2  A  Word  Regarding  “Proper  Models” .  81 

3.7  Summary .  81 

4  The  Robust  Beauty  of  Differentially-Generated  Improper  Parametric  Models  83 

4.1  Introduction .  83 

4.2  Analysis  of  a  Proper  Parametric  Model .  85 

4.2. 1  The  Proper  Parametric  Model .  85 

4.2.2  Probabilistic  Learning  for  the  Asymptotically  Large  Training  Sample .  89 

4.2.3  Differential  Learning  via  CFM  for  the  Asymptotically  Large  Training  Sample  ...  90 

4.2.4  Results  of  Differential  and  Probabilistic  Learning  for  Asymptotically  Large  and  Small 

Training  Samples  .  92 

4.3  Analysis  of  an  Improper  Parametric  Model .  99 

4.3.1  The  Improper  Parametric  Model .  1(X) 

4.3.2  Probabilistic  Learning  via  MSB  for  the  Asymptotically  Lju'ge  Training  Sample  ...  102 

4.3.3  Differential  Learning  via  CFM  for  the  Asymptotically  Large  Training  Sample  ...  104 

4.3.4  Results  of  Differential  and  Probabilistic  Learning  for  Asymptotically  Large  and  Small 

Training  Samples .  105 

4.4  Summary .  Ill 

5  Properties  of  the  CFM  Objective  Function  113 

5.1  Introduction .  113 

5.2  Discriminator  Output  Space .  114 

5.2. 1  The  Discriminant  Differential  6r ,  the  Reduced  Discriminant  Continuum,  and  the 

Reduced  Discriminant  Boundary .  120 

5.3  Objective  Function  Monotonicity  and  Learning  Efficiency .  122 

5.3.1  MAE  is  Non-Monotonic .  127 

5.3.2  MSE  is  Non-Monotonic .  132 

5.3.3  The  Kullback-Leibler  Information  Distance  is  Non-Monotonic .  135 

5.3.4  The  General  Error  Measure  is  Non-Monotonic .  140 

5.3.5  The  Link  Between  Objective  Function  Monotonicity  and  Learning  Efficiency  .  ...  141 


VII 


5.3.6  CFM  is  Monotonic .  142 

5.4  Training  Example  Types .  148 

5.5  The  Convergence  Properties  of  Differential  Learning  via  CFM .  150 

5.5.1  Differential  Learning  via  the  Synthetic  Form  of  CFM  is  Reasonably  Fast .  153 

5.5.2  Differential  Learning  via  the  Original  Forms  of  CFM  is  Unreasonably  Slow  and/or 

Inefficient .  155 

5.6  Summary .  156 

6  An  Information-Theoretic  View  of  Stochastic  Concept  Learning  159 

6.1  Introduction .  159 

6.2  Probabilistic  versus  Differential  Complexity .  161 

6.3  Exploring  the  Curious  Relationship  Between  Winning  a  Rigged  Game  of  Dice  and  Building 

an  Efficient  Classifier .  163 

6.3. 1  The  Differential  Mechanism  by  which  the  Most  Likely  Die  Face  Becomes  Empirically 

Evident  .  164 

6.3.2  The  Probabilistic  Mechanism  by  which  the  Most  Likely  Die  Face  Becomes  Empiri¬ 
cally  Evident .  171 

6.3.3  Discriminant  Information  versus  Probabilistic  Information  .  175 

6.4  Bounds  on  the  Training  Sample  Size  Requirements  of  the  Differential  and  Probabilistic 

Learning  Strategies .  176 

6.4.1  A  Greatest  Lower  Bound  on  oa .  177 

6.4.2  A  Greatest  Lower  Bound  on  lip .  178 

6.5  Extending  the  Rigged  Die  Paradigm  to  the  General  C-class  Leaming/Pattem  Recognition  Task  179 

6.6  Summary .  181 

11  Applications  183 

7  Implementing  Differential  Learning  185 

7.1  Introduction . 185 

7.2  Three  Hypothesis  Classes .  186 

7.2. 1  The  Linear  Hypothesis  Class .  186 

7.2.2  The  Logistic  Linear  Hypothesis  Class .  186 

7.2.3  The  Gaussian  Radial  Basis  Hypothesis  Class  .  187 

7.3  Learning  to  Identify  the  Irises  of  the  Gaspe  Peninsula  .  187 

7.4  Controlling  the  Confidence  Parameter .  191 

7.5  Focussing  on  the  Un-Leamed  Examples .  193 

7.6  Rejecting  the  Classification  . 202 

7.7  The  Importance  of  Representational  Choices .  207 


7.8  Minimizing  the  Classifier’s  Complexity 

7.9  Summary . 


215 

217 


viii 


8 


Optical  Character  Recognition  with  Diflerential  Learning 


8. 1  Introduction . 

8.1.1  A  Word  Regarding  Training  and  Test  Samples 


8.2  Test  and  Evaluation  Protocols . 

8.2.1  Estimating  Error  Rates . 

8.2.2  Estimating  a  Classifier’s  MSDE 

8.2.3  Graphical  Statistical  Summaries 


8.3  Compressing  the  Data  to  Improve  Generalization 

8.4  Recognition  Results . 


8.4. 1  Experiments  with  the  Logistic  Linear  Hypothesis  Class . 

8.4.2  Experiments  with  Alternative  Hypothesis  Classes . 

8.4.3  Interpretation  of  Results . 

8.4.4  Rejecting  Classifications  After  Learning . 

8.5  Recognition  Results  in  the  Presence  of  Noise  . 

8.5.1  Signal-to-Noise  Ratio  (SNR)  Computations . 

8.5.2  The  Compressed  Noisy  Feature  Vector  is  Approximately  Homoscedastic  Gaussian  . 

8.5.3  Recognition  Results  for  a  Moderate  SNR . 

8.5.4  Recognition  Results  for  a  Low  SNR . 

8.6  Summary . 


219 

219 

220 

221 

221 

222 

224 

225 
230 
232 
239 
243 
246 
248 
248 
250 
253 
261 
266 


9  Medical  Diagnosis  with  Differential  Learning  271 

9.1  Introduction .  271 

9.2  Recognition  Results . 277 

9.3  Summary . 284 


10  Remote  Sensing  with  Differential  Learning  285 

10.1  Intrxxluction . 285 

10.2  Training  Data  . 287 

10.3  Experimental  Results . 289 

10.3.1  Interpretation  of  Test  Results  . 298 

10.4  Summary .  301 


11  Conclusions 

1 1 . 1  Scientific  Contributions . 

1 1 .2  Philosphical  Implications  of  Differential  Learning 


303 

303 

304 


IX 


1 1 .3  Future  Research .  307 

A  Glossary  of  Notation  311 

B  Notes  on  Convergence  319 

C  The  Box  Plot  Statistical  Summary  321 

C.l  How  to  Read  a  Box  Plot . 321 

C. 2  How  to  Construct  a  Box  Plot  .  323 

D  A  Synthetic  Functional  Form  of  the  Classification  Figure  of  Merit  327 

D.  1  Specifications  for  the  Synthetic  CFM  Objective  Function .  328 

D.2  The  Computational  Cost  of  the  Synthetic  CFM  Objective  Function . 334 

D.3  The  Convergence  Properties  of  Differential  Learning  via  the  CFM  Objective  Function  .  .  .  334 

D.3. 1  The  Convergence  Properties  Differential  Learning  via  the  Original  Logistic  Sigmoidal 

Form  of  CFM .  336 

D.3.2  The  Convergence  Properties  of  Differential  Learning  via  the  Synthetic  Form  of  CFM  340 

D.4  A  Proof  Relating  to  Synthetic  CFM  and  Chapter  2 . 346 

D.5  Modifying  Backpropagation  for  use  with  CFM .  348 

D.6  Source  Code  for  the  Synthetic  CFM  Objective  Function . 351 

E  Differential  Learning  via  CFM  Viewed  as  a  Generalization  of  Learning  via  Rosenblatt’s 
Perceptron  Criterion  Function  359 

F  Proper  Parametric  Models  of  the  Homoscedastic  Gaussian  Feature  Vector  363 

F.  1  The  Fully-Parametric  Proper  Model . 364 

F.2  The  Partially-Parametric  Proper  Model . 365 

F.2. 1  C  =  2:  Logistic  Regression  .  365 

F.2.2  C  >2:  Logistic  Discriminant  Analysis .  367 

F.3  The  Asymptotic  Relative  Efficiency  of  Logistic  Discriminant  Analysis  Versus  Normal-Based 

Linear  Discriminant  Analysis . 368 

F. 4  The  Proper  Parametric  Model  Constraints  are  Severe .  369 

G  Error  Rate  Computations  for  the  Classifiers  of  Chapter  4  371 

G.  1  Error  Rate  Computations  for  the  Proper  Parametric  Model .  371 

G.2  Error  Rate  Computations  for  the  Improper  Parametric  Model .  372 


H  Asymptotic  Parameterizations  for  the  Probabilistically-Generated  Improper  Parametric  Mod¬ 
els  of  Chapter  4  375 

H.  I  Distribution-Independent  Expressions  for  the  Parameterization  of  Low-Order  Polynomial 

Discriminant  Functions .  377 

H. 2  Distribution-Dependent  Expressions  for  the  Parameterization  of  Polynomial  Discriminant 

Functions .  379 

I  Monotonic  Fractions  Generated  by  Three  Error  Measures  383 

I. l  MAE  Monotonic  Fractions .  383 

1.2  MSE  Monotonic  Fractions . 385 

1.3  Kullback-Leibler  Monotonic  Fractions .  387 

J  Tabulated  Die  Casting  Bounds  391 

K  A  Modified  Radial  Basis  Function  Classifier  405 

L  Anderson  &  Fisher’s  Iris  Data  407 

L.  1  Original  Iris  Data  . 407 

L. 2  Normalized  Iris  Data . 409 

M  Complexity  Reduction  Techniques  413 

M. l  Weight  Decay . 413 

M.1.1  Parametric  Entropy  . 415 

M.2  Weight  Smoothing . 416 

M.3  Linear  Non-In vertible  Feature  Vector  Compression . 417 

M.3.1  A  Brief  Argument  Against  Principal  Components  Analysis . 419 


Index 


431 


List  of  Tables 


2. 1  Some  well-known  classification  paradigms  and  the  learning  strategies  they  employ .  25 

4. 1  The  minimum-MSE  parameterizations  for  the  minimum-,  low-,  and  high-complexity  poly¬ 
nomial  classifiers  of  x  when  the  training  sample  size  n  is  asymptotically  large  (i.e.. 


n  -4  oo). 


8.1  Estimated  MSDE  for  the  high  and  low-complexity  logistic  linear  classifiers  employing 

differential  and  probabilistic  learning . 232 

8.2  Estimated  discriminant  bias,  discriminant  variance,  and  MSDE  for  6S0-parameter  classifiers 
generated  from  the  linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential 

and  probabilistic  learning . 245 

8.3  Empirical  benchmark  test  sample  error  rates  for  650-parameter  classifiers  produced  from  the 

linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential  and  probabilistic 
learning . 247 

8.4  Benchmark  test  sample  rejection/empirical  error  rate  statistics  for  650-parameter  classifiers 
produced  from  the  linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential 

and  probabilistic  learning . 247 

8.5  Estimated  relative  efficiency  of  differential  versus  probabilistic  learning  for  the  linear,  logistic 

linear,  and  modified  Gaussian  RBF  hypothesis  classes . 258 

9. 1  Estimated  discriminant  bias,  discriminant  variance,  and  MSDE  for  the  257-parameter  logistic 
linear  classifiers  generated  by  differential  and  probabilistic  learning.  Estimates  are  also  shown 
for  Manduca,  Christy,  and  Ehman’s  20SB-parameter  logistic  linear  and  6 1 64-parameter  multi¬ 
layer  perceptron  (MLP)  classifiers,  both  generated  probabilistically  with  the  MSE  objective 
function  [90].  Finally,  estimates  are  shown  for  10  Human  subjects  [90]:  each  performed  one 


2-fold  cross  validation  trial. 


10.1  The  training  sample  sizes  for  both  the  maximum-likelihood  and  DRBF  classifiers . 288 


XI 


XII 


10.2  Left;  Class  labels  assigned  (o  the  11  ground  truth  classes.  Right:  Top  ten  confusions  made 

by  the  DRBF  classifier  over  the  civill  site . 291 

10.3  Confusion  matrix  for  the  DRBF  classifier  over  the  civill  site .  291 

10.4  Left;  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made 

by  the  maximum-likelihood  (ML)  classifier  over  the  civill  site .  292 

10.5  Confusion  matrix  for  the  maximum-likelihood  (ML)  cla.ssifier  over  the  civill  site . 292 

10.6  Left;  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made 

by  the  DRBF  classifier  over  the  gaol  site .  295 

10.7  Confusion  matrix  for  the  DRBF  classifier  over  the  gaol  site .  295 

10.8  Left:  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made 

by  the  maximum-likelihood  (ML)  classifier  over  the  gaol  site .  296 

10.9  Confusion  matrix  for  the  maximum-likelihood  (ML)  classifier  over  the  gaol  site . 296 

lO.lOA  summary  of  the  empirical  test  sample  error  rates  for  both  the  maximum-likelihood  and 

DRBF  classifiers .  298 

C.  1  A  listing  of  indices  used  to  compute  box  plot  5-number  summaries  for  various  sample  sizes.  324 


# 


List  of  Figures 

2.1  Left;  The  class-conditional  density  — class  priorprobability  products 

for  a  three-class  random  scalar  x.  Right:  The  associated  a  posteriori  probabilities 
Pw|x  (^^1 1  of  three-classes . 

2.2  A  diagrammatic  view  of  the  classifier  and  its  associated  functional  mappings . 

2.3  The  a  posteriori  class  differentials  Aw|x(^<l-*’)  three-class  random  variable  x 

depicted  in  figure  2.1 . 

2A  Left:  The  a  posteriori  class  probabilities  P»v|x(^i  !■*)  of  three-class  random  variable 
depicted  in  figures  2.1  and  2.3,  with  a  differential  form  of  the  Bayesian  discriminant 
function  superimposed.  Right:  The  a  posteriori  class  differenUals 

A>v|x(f^i  I  Jt)  of  the  same  three-class  random  variable,  with  the  discriminant  differentials  of 

^ WBaytK  Diffmntial  SUpcrimpOSed . 

2.5  A  diagrammatic  comparison  of  error  measure  (EM)  and  classification  figure-of-merit  (CFM) 

objective  functions . 

2.6  A  synthetic  asymmetric  sigmoidal  form  of  the  classification  figure-of-merit  (CFM) . 

2.7  The  discriminator  output’s  minimum-error  value  for  the  Minkowski-r  power  metric 

(  r  =  1 .25, 2, 9 ;  binary  output  target  values) . 

3. 1  The  discriminant  bias,  discriminant  variance,  and  mean-squared  discriminant  error  (MSDE) 

of  three  different  classifier  paradigms . 

4.1  A  two-class  scalar  feature  discrimination  task.  The  single  feature  is  a  homoscedastic, 

Gaussian-distributed  random  variable . 

4.2  The  proper  parametric  model  of  x.  The  logistic  linear  hypothesis  class  follows  from  both  the 

partially-parametric  and  fully-parametric  proper  models  of  x . 

4.3  A  comparison  of  error  rates  for  differentially  (CFM)  and  probabilistically  (MSE,  CE,  and 

ML)  generated  logistic  linear  classifiers . 


XIV 


4.4  The  empirical  class-conditional  pdfs  of  x  multiplied  by  theircmpirical  class  prior  probabilities 

for  a  training  sample  size  of  10  examples .  95 

4.5  The  empirical  a  posteriori  class  probabilities  of  x,  shown  in  histogram  form  for  the  10 

examples  of  figure  4.4.  The  discriminant  functions  of  the  CE-generated  logistic  linear 
classifier  (i.c.,  the  partially-paramctric  model  of  x )  are  superimposed  in  black .  96 

4.6  The  same  empirical  a  posteriori  class  probabilities  shown  in  figure  4.5.  The  discriminant 

functions  of  the  CFM-generated  logistic  linear  classifier  are  superimposed  in  black .  96 

4.7  A  comparison  of  the  approximated  mean-squared  discriminant  error  (~MSDE)  for  the 

differentially  (CFM)  and  probabilistically  (MSE,  CE,  and  ML)  generated  classifiers .  98 

4.8  A  comparison  of  the  approximated  discriminant  bias  (~DBias)  for  the  differentially  (CFM) 

and  probabilistically  (MSE)  generated  logistic  linear  classifiers .  98 

4.9  A  three-class  scalar  feature  discrimination  task.  The  single  feature  is  a  heteroscedastic, 

uniformly-distributed  random  variable .  101 

4. 10  The  polynomial  classifier  of  x  depicted  as  a  neural  network  paradigm .  102 

4. 1 1  Discriminant  functions  of  probabilistically  (MSE)  and  differentially  (CFM)  generated  poly¬ 
nomial  classifiers  of  x  for  an  asymptotically  large  training  sample  size . 107 

4.12  Discriminant  functions  of  probabilistically  (MSE)  and  differentially  (CFM)  generated  poly¬ 
nomial  classifiers  of  x  for  a  training  sample  of  size  n  =  100 .  108 

4.13  A  comparison  of  error  rates  for  differentially  (CFM)  and  probabilistically  (MSE)  generated 

polynomial  classifiers .  110 

4. 14  A  comparison  of  the  approximated  mean-squared  discriminant  error  (~MSDE)  for  differen¬ 
tially  (CFM)  and  probabilistically  (MSE)  generated  polynomial  classifiers .  110 

S.  I  Reduced  discriminator  output  space  for  a  hypothetical  classifier  with  C  outputs  that  take  on 

values  between  zero  and  one .  119 

5.2  An  illustration  of  the  discriminant  differential  and  its  relationship  to  reduced  discriminator 

output  space .  121 

5.3  The  MSE  objective  function  is  non-monotonic .  127 

5.4  The  MAE  objective  function  can  be  monotonic  if  and  only  if  the  number  of  classes  is  C  =  2 .  1 30 

5.5  The  MAE  objective  function  is  increasingly  non-monotonic  as  C  increases .  131 

5.6  The  MSE  objective  function  is  increasingly  non-monotonic  as  C  increases . 135 

5.7  The  Kullback-Leibler  information  distance  (CE  objective  function)  is  non-monotonic.  ...  139 

5.8  The  CE  objective  function  is  increasingly  non-monotonic  as  C  increases . 140 

5.9  The  simple  two-class  scalar  feature  discrimination  task  for  which  CFM  is  monotonic  if  and 

only  if  V’  ~  .05 . 145 


# 


5.10  The  CFM  generated  by  an  asymptotically  large  training  sample  of  the  two-class  random 

feature  x  described  by  (5.76)  —  (5.78) .  146 

5.1 1  Details  of  the  maximum  CFM  generated  by  an  asymptotically  large  training  sample  of  the 

two-class  random  feature  x .  146 

5.12  Three  types  of  training  examples:  un-leamed  examples,  transition  examples,  and  learned 

examples .  149 

5.13  The  synthetic  CFM  objective  function,  given  a  confidence  parameter  of  t/'  =  .05 . 153 

5. 1 4  Old  forms  of  the  CFM  objective  function,  described  in  [55] .  156 

7. 1  Two  of  the  four  features  (petal  length  and  width)  E.  Anderson  measured  on  the  Irises  of  the 

Gaspe  Peninsula  [3] .  189 

7.2  The  confusable  examples  of  figure  7.1,  plotted  as  a  function  of  the  other  two  features  (sepal 

length  and  width) .  190 

7.3  The  IS-parameter  differential  logistic  linear  classifier’s  output  state  — as  projected  onto 

the  reduced  discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  data  with  high 
confidence .  194 

7.4  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn 

the  Iris  data  with  moderate  confidence .  191/ 

7.5  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn 

the  Iris  data  with  low  confidence .  196 


7.6  Left:  Histograms  of  the  output  states  for  the  classifier  in  figure  7.4  after  350  learning  epochs: 

=  0.6 .  Right:  Histograms  of  the  output  states  for  the  same  classifier  (figure  7.5)  after 
350  learning  epochs:  V'  is  reduced  from  0.6  to  0. 1  over  the  first  200  learning  epochs.  ...  197 

7.7  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter 

logistic  linear  classifier  shown  in  figure  7.5  as  it  learns  differentially  (CFM).  The  classifier’s 
empirical  error  rate  after  350  learning  epochs  is  1 .3  (+2.5/- 1 .3)% .  198 

7.8  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  for  the  1 5-parameter  logistic 
linear  classifier  as  it  leams  probabilistically  (MSB)).  The  classifier’s  empirical  eiror  rate  after 

350  learning  epochs  is  2.0  (+2.9/-2.0)% .  199 

7.9  The  IS-parameter  logistic  linear  classifier’s  output  state  —  as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (MSE  — 

see  figure  7.8) . 2(X) 

7. 1 0  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  for  the  1 5-parameter  logistic 
linear  classifier  shown  in  figure  7.3  as  it  learns  probabilistically  (Kullback-Leibler  —  CE). 

The  classifier’s  empirical  error  rate  after  350  learning  epochs  is  2.0  (+2.9/-2.0)% . 201 


XVI 


7.!  1  The  15-parameter  logistic  linear  classifiei’s  output  state  —  as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (CE  —  see 
figure  7. 10) .  202 

7.12  Figure  7.4  shown  with  a  rejection  threshold  of  Srtjtn  =  0.35  .  203 

7.13  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn 

the  Iris  data  with  lower  confidence  (0.26),  shown  with  a  rejection  threshold  of  Srejta  =  0. 1 5 . 204 

7. 14  Figures  7.9  (MSE,  top)  and  7.1 1  (CE,  bottom)  shown  with  a  rejection  threshold  of  Srtjea  = 

0.35 .  205 

7. 1 5  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  fot  the  1 5-parameter  linear 

classifier  as  it  learns  differentially  (CFM).  The  classifier’s  empirical  error  rate  after  350 
learning  epochs  is  1 .3  (+2.5/- 1 .3)% .  207 

7.16  The  15-parameter  differentially-generated  linear  classifier’s  output  state  — as  projected 

onto  the  reduced  discriminant  continuum  —  after  attempting  to  Icam  all  the  Iris  with  low 
confidence.  The  classifier  cannot  Icam  examples  83  and  1 33  (see  figure  7.2) . 208 

7. 1 7  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter  linear 

classifier  as  it  learns  probabilistically  (MSE)).  The  classifier’s  empirical  error  rate  after  350 
learning  epochs  is  14.7  (+6.2/-5.8)% . 209 

7. 1 8  The  1 5-parameter  linear  classifier’s  output  state — as  projected  onto  the  reduced  discriminant 

continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (MSE).  The  classifier 
cannot  leam  22  of  the  examples . 210 

7.19  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter 

modified  RBF  classifier  as  it  teams  differentially  (CFM).  The  classifier’s  empirical  error  rate 
after  350  learning  epochs  is  2.0  (+2.9/-2.0)% . 211 

7.20  The  1 5-parameter  differential  modified  RBF  classifier’s  output  state  —  as  projected  onto  the 
reduced  discriminant  continuum  —  after  attempting  to  leam  all  the  Iris  with  low  confidence. 

The  classifier  cannot  leam  examples  70, 83,  and  1 33  (see  figure  7.2) . 212 

7.21  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter 
modified  RBF  classifier  as  it  learns  probabilistically  (MSE)).  The  classifier’s  empirical  error 

rate  after  350  learning  epochs  is  4.7  (+4.0/-3.4)% .  213 

7.22  The  15-parameter  modified  RBF  classifier’s  output  state  — as  projected  onto  the  reduced 

discriminant  continuum  —  after  attempting  to  leam  all  the  Iris  probabilistically  (MSE).  The 
classifier  cannot  leam  7  of  the  examples . 214 

7.23  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter  modi¬ 

fied  RBF  classifier  as  it  leams  probabilistically  (Kullback-Leibler  —  CE)).  The  classifier’s 
empirical  error  rate  after  350  learning  epochs  is  6.7  (+4.6/-4.0)% .  215 


7.24  The  15-parameter  modified  RBF  classifier’s  output  state  — as  projected  onto  the  reduced 
discriminant  continuum  — after  attempting  to  learn  all  the  Iris  probabilistically  (Kullback- 
Leibler  —  CE).  The  classifier  cannot  learn  10  of  the  examples . 216 

8. 1  Forty  digits  randomly  chosen  from  the  AT&T  DB 1  database . 220 

8.2  Parameters  or  weights  of  the  logistic  linear  classifier  after  learning  the  DBl  database’s 

benchmark  training  sample  differentially . 225 

8.3  Left:  Test  sample  classification  summaries  for  the  2570-parameter  logistic  linear  classifier 
employing  differential  learning  ( )  and  two  forms  of  probabilistic  learning  ( Ap ).  Right: 

The  difference  between  the  probabilistically-generated  models’  empirical  error  rates  and  the 
differentially-generated  model’s  rate  on  a  trial-by-trial  basis . 226 

8.4  The  distribution  of  parameter  values  in  the  257-parameter  logistic  linear  discriminant  function 

representing  the  digit  “3”  (cf.  figure  8.2) .  227 

8.5  The  same  digits  shown  in  figure  8.1,  linearly  compres.sed  from  256- to  64-pixel  images.  .  .  228 

8.6  Parameters  or  weights  of  the  650-parameter  logistic  linear  classifier  after  learning  the  DB  1 

database's  benchmark  training  sample  differentially . 228 

8.7  The  distribution  of  parameter  values  in  the  65-parameter  logistic  linear  discriminant  function 

representing  the  digit  “3  ” . 229 

8.8  Test  sample  empirical  error  rates  with  95%  confidence  intervals  for  the  DB  I  database’s 

benchmark  split  of  training/testing  examples . 231 

8.9  Left;  Test  sample  classification  summaries  for  the  650-parameter  logistic  linear  classifier 

employing  differential  and  probabilistic  learning.  Right:  The  difference  between  the 
probabilistically-generated  models’  empirical  error  rate  and  the  differentially-generated 
model’s  rate  on  a  trial-by-trial  basis . 231 

8.10  The  empirical  error  rates  for  the  650-parameter  logistic  linear  classifier  as  it  learns  the 

benchmark  training  sample  differentially . 233 

8.1 1  The  6S0-parameter  logistic  linear  classifier’s  output  state  —  as  projected  onto  reduced  dis¬ 
criminator  output  space  —  after  learning  the  600  benchmark  training  examples  differentially.  234 

8.12  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650- 

parameter  logistic  linear  classifier  as  it  learns  the  benchmark  training  sample  probabilistically 
(MSB  objective  function) . 235 

8.13  The  650-paranieter  logistic  linear  classifier’s  output  state  — as  projected  onto  reduced 

discriminator  output  space  —  after  it  attempts  to  learn  the  600  benchmark  training  examples 
probabilistically .  236 


8.14  The  empirical  error  rates  for  the  650-paramcter  logistic  linear  classifier  as  it  learns  the 

benchmark  training  sample  probabilistically  (CE  objective  function) . 237 

8.15  The  650-parameter  logistic  linear  classifier’s  output  state  — as  projected  onto  reduced 

discriminator  output  space  —  after  it  attempts  to  learn  the  600  benchmark  training  examples 
probabilistically .  238 

8.16  The  empirical  error  rates  for  the  650-parameter  linear  classifier  as  it  learns  the  benchmark 

training  sample  differentially .  240 

8. 1 7  The  650-parameter  linear  classifier’s  output  state  —  as  projected  onto  reduced  discriminator 

output  space  —  after  learning  the  600  benchmark  training  examples  differentially . 241 

8.18  The  empirical  error  rates  for  the  650-parameter  linear  classifier  as  it  leams  the  benchmark 

training  sample  probabilistically  (MSE  objective  function) .  242 

8.19  The  650-parameter  linear  classifier’s  output  state  — as  projected  onto  the  reduced  dis¬ 

criminator  output  space  —  after  it  attempts  to  learn  the  600  benchmark  training  examples 
probabilistically  (MSE  objective  function) . 243 


8.20  Left:  Test  sample  classification  summaries  for  the  650-parameter  linear  classifier  employing 

differential  learning  and  the  MSE  form  of  probabilistic  learning  (Ap).  Right: 

The  difference  between  the  probabilistically-generated  models’  empirical  error  rate  and  the 
differentially-generated  model’s  rate  on  a  trial-by-trial  basis . 244 

8.21  Left:  Test  sample  classification  summaries  for  the  650-parameter  modified  RBF  classifier 


employing  differential  learning  ( )  and  two  forms  of  probabilistic  learning  ( Ap ).  Right: 

The  difference  between  the  probabilistically-generated  models’  empirical  error  rate  and  the 
differentially-generated  model’s  rate  on  a  trial-by-trial  basis . 244 

8.22  The  probability  density  function  for  a  noisy  compressed  DB 1  image  pixel  when  the  probability 

of  pixel  inversion  is  0.2 . . . 253 

8.23  Moderately  noisy  versions  of  the  digits  shown  in  figure  8.5 . 254 

8.24  Parameters  of  the  differential  logistic  linear  classifier  generated  with  the  first  of  25  random 

DB  1  database  training/testing  splits  with  moderate  noise . 254 


8.2S  Left:  Test  sample  classification  summaries  for  the  650-parameter  logistic  linear  classifier 
employing  differential  learning  (  Aa  )  and  two  forms  of  probabilistic  learning  ( Ap ).  Right: 

The  increase  in  the  discriminant  error  of  the  two  probabilistically-generated  models  over  the 
differentially-generated  model  on  a  trial-by-trial  basis . 257 


8.26  Left:  The  test  sample  classification  summaries  shown  in  figure  8.25  for  the  650-parameter 
(64-pixels/digit)  logistic  linear  classifier  employing  differential  and  probabilistic  learning 
Right:  Classification  summaries  for  fifteen  human  subjects  asked  to  classify  the  40  64-pixel 
examples  shown  in  figure  8.23.  Far  Right:  Classification  summaries  for  fifteen  different 
human  subjects  asked  to  classify  the  analogous  40  256-pixel  examples  shown  in  figure  8.33.  257 

8.27  Left:  Test  sample  classification  summaries  for  the  650-parameter  linear  classifier  employing 
differential  learning  (  Aa  )  and  the  MSE  form  of  probabilistic  learning  ( Ap ).  Right:  The 


increase  in  the  discriminant  error  of  the  probabilistic  model  over  the  differentially-generated 
model  on  a  trial-by-trial  basis . 260 

8.28  Very  noisy  versions  of  the  digits  shown  in  figure  8.5 . 262 

8.29  Parameters  of  the  differential  logistic  linear  classifier  generated  with  the  first  of  25  random 

DB I  database  training/testing  splits  with  strong  noise . 262 


8.30  Left:  Test  sample  classification  summaries  for  the  6S0-parameter  logistic  linear  classifier 
employing  differential  learning  (  Aa  )  and  two  forms  of  probabilistic  learning  ( Ap ).  Right: 

The  increase  in  the  discriminant  error  of  die  two  probabilistically-generated  models  over  the 
differentially-generated  model  on  a  trial-by-trial  basis . 265 

8.31  Left:  The  test  sample  classification  summaries  shown  in  figure  8.30  for  the  650-parameter 
(64-pixels/digit)  logistic  linear  classifier  employing  differential  and  probabilistic  learning. 

Right:  Classification  summaries  for  fifteen  human  subjects  asked  to  classify  the  40  64-pixel 
examples  shown  in  figure  8.28.  Far  Right:  Classification  summaries  for  fifteen  different 
human  subjects  asked  to  classify  the  analogous  40  256-pixel  examples  shown  in  figure  8.34.  265 

8.32  Left:  Test  sample  classification  summaries  for  the  650-parameter  linear  classifier  employing 
differential  learning  and  the  MSE  form  of  probabilistic  leaming.  Right:  The  increase  in  the 
discriminant  error  of  the  probabilistic  model  over  the  differentially-generated  model  on  a 


trial-by-trial  basis .  266 

8.33  Moderately  noisy  256-pixel  digits  formed  by  flipping  the  binary  pixels  of  the  original  DB  I 

database  with  probability  0.1 . 269 

8.34  Very  noisy  256-pixei  digits  formed  by  flipping  the  binary  pixels  of  the  original  DB  I  database 

with  probability  0.2 . 269 

8.35  Correct  labels  for  the  digits  in  figures  8.23 .  270 

8.36  Correct  labels  for  the  digits  in  figures  8.28 .  270 

8.37  Correct  labels  for  the  digits  in  figures  8.33 .  270 

8.38  Correct  labels  for  the  digits  in  figures  8.34 .  270 


XX 


9.1  Fourteen  1024-pixel  examples  of  healthy  (bottom  row)  and  AVN  compromised  (top  row) 

femoral  heads .  273 

9.2  Left:  The  parameters  of  a  1024-pixel  logistic  linear  classifier,  obtained  by  differentially 
learning  all  1 25  example  images.  Right:  A  histogram  of  the  weights  in  the  left  figure.  .  .  .  273 

9.3  Tlie  images  in  figure  9. 1 ,  linearly  compressed  to  256  pixels .  275 

9.4  Left:  The  parameters  of  a  256-pixel  logistic  linear  classifier,  obtained  by  differentially 
learning  all  125  example  images.  Right:  A  histogram  of  the  weights  in  the  left  figure.  .  .  .  275 

9.5  The  257-parameter  differentially-generated  logistic  linear  classifier's  output  state  — as 
projected  onto  the  reduced  discriminant  continuum  —  after  learning  all  1 25  AVN  examples.  276 

9.6  The  257-parameter  differentially-generated  logistic  linear  classifier’s  output  state  — as 

projected  onto  the  reduced  discriminant  continuum  —  after  a  typical  learning  trial  for  which 
the  training  sample  size  is  58  AVN  images  selected  randomly  from  the  pool  of  125  total 
images .  278 

9.7  Left:  The  differentially-generated  logistic  linear  classifier’s  sensitivity  and  specificity  for 

the  trial  depicted  in  figure  9.6.  Right:  The  associated  receiver  operator  characteristic  for 
detecting  an  AVN-compromised  femoral  head . 279 

9.8  'The  three  test  images  that  the  differentially-generated  logistic  linear  classifier  of  figure  9.6 

misclassifies . 280 

9.9  Left:  Test  sample  classification  summaries  for  the  low-complexity  logistic  linear  classifier 

employing  differential  learning  and  two  forms  of  probabilistic  learning.  Right:  The  increase 
in  the  discriminant  error  of  the  two  probabilistic  models  over  the  differentially-generated 
model  on  a  trial-by-trial  basis .  281 

9.10  Test  sample  classification  summaries  for  the  low-complexity  (257-parameter)  differentially- 

generated  logistic  linear  classifier  (far  left),  Manduca,  Christy,  and  Ehman's  [90] 
probabilistically-generated  linear  classifier  (middle  left),  and  their  best  probabilistically- 
generated  non-linear  classifier  (middle  right).  Each  of  ten  human  subjects  performed  one 
2-fold  cross  validation  trial  (far  right);  the  differentially-generated  logistic  linear  classifier 
compares  favorably  with  the  humans . 281 

10.1  Top  Left:  Panchromatic  image  of  the  civill  site  (1.2  meter  resolution).  Top  Right: 

Composite  of  the  multi-spectral  data  for  the  civill  site  (8  meter  resolution),  which  the 
classifiers  interpret .  289 

10.2  Top:  The  DRBF  classifier’s  interpretation  of  the  civil  I  site.  Middle  The  ground  truth  for  the 
civill  site.  Bottom  The  maximum-likelihood  (ML)  classifier’s  interpretation  of  the  civill  site.  290 

10.3  Top  Left:  Panchromatic  image  of  the  gaol  site  (1.2  meter  resolution).  Top  Right:  Composite 

of  the  multi-spectral  data  for  the  gaol  site  (8  meter  resolution),  which  the  classifiers  interpret.  293 


10.4  Top:  The  DRBF  classifier’s  interpretation  of  the  gaol  site.  Middle  The  ground  truth  for  the 
gaol  site.  Bottom  The  maximum-likelihood  (ML)  classifier’s  interpretation  of  the  gaol  site.  294 

10.5  Interpretation  of  the  White  House  site .  299 

10.6  Interpretation  of  the  Bureau  of  Engraving  site . .S(X) 

11.1  A  simplified  view  of  efficient  autonomous  learning .  308 

C.  I  A  box  plot  for  a  sample  of  the  random  variable  .v . 322 

D.  I  Details  of  the  synthetic  asymmetric  sigmoidal  form  of  the  classification  figure-of-merit 

(CFM) . 329 

D.2  Three  types  of  training  examples:  un-leamed  examples,  transition  examples,  and  learned 

examples .  335 

D.3  Left:  The  original  logistic  sigmoidal  form  of  the  CFM  objective  function  for  four  values  of 
the  steepness  parameter  p  (figure  adapted  from  [55]).  Right:  The  function’s  first  derivative 

with  respect  to  the  discriminant  differential  6  for  the  same  four  values  of  p.  . . 337 

D.4  The  logistic  sigmoidal  form  of  the  CFM  objective  function  has  a  first  derivative  that  decreases 

exponentially  with  increasing  steepness  parameter  P .  339 

D.5  The  ratio  of  the  differential  learning  rate  for  transition  examples  to  that  for  un-leamed 

examples  with  a  nominal  discriminant  differential  value  of  5  —  —.1 . 341 

D.6  The  slope  of  the  synthetic  CFM  objective  function’s  lower  leg,  as  a  function  of  the  confidence 

parameter  V’ .  343 

D.7  The  slope  of  the  synthetic  CFM  objective  function’s  transition  region,  as  a  function  of  the 

confidence  parameter  t/j . 343 

D.8  Equivalent  logistic  and  synthetic  CFM  functional  forms . 345 

D. 9  A  diagrammatic  view  of  backpropagation  with  the  CFM  objective  function . 349 

E.  1  A  classifier  comprising  a  single  linear  discriminant  function  is  equivalent  to  Rosenblatt’s 

perception  when  generated  with  this  modified  form  of  the  CFM  objective  function . 360 

M.l  Left:  The  parameters  of  a  1024-pixel  differentially-generated  logistic  linear  classifier 

described  in  chapter  9.  Right:  A  histogram  of  the  weights  in  the  left  figure . 414 

M.2  Left:  The  parameters  of  the  logistic  linear  classifier  shown  in  figure  M.l,  generated  by 

differential  learning  with  weight  decay.  Right:  A  histogram  of  the  weights  in  the  left  figure.  415 
M.3  The  moving  average  filter  kernel  used  for  weight  smoothing . 417 


M.4  Left:  The  parameters  of  the  logistic  linear  classifier  shown  in  figure  M.  I ,  generated  by 
differential  learning  with  weight  smoothing.  Right:  A  histogram  of  the  weights  in  the  left 

figure .  418 

M.5  Top:  Linear  non-invertible  compression  with  a  compression  ratio  of  4:1.  Bottom:  Linear 

non-invertible  compression  with  a  compression  ratio  of  2.25: 1 . 419 


Chapter  1 


Overview 


1.1  Abstract 

There  is  more  to  learning  stochastic  concepts  for  robust  statistical  pattern  recognition  than  the  learning 
itself:  computational  resources  must  be  allocated  and  information  must  be  obtained.  Therein  lies  the  key 
to  a  learning  strategy  that  is  efficient,  requiring  the  fewest  resources  and  the  least  information  necessary  to 
produce  classifiers  that  generalize  well.  Probabilistic  learning  sbategies  currently  used  with  connectionist 
(as  well  as  most  traditional)  classifiers  are  often  inefficient,  requiting  high  classifier  complexity  and  large 
training  sample  sizes  to  ensure  good  generalization.  An  asymptotically  efficient  differential  learning  strategy 
is  set  forth.  It  guarantees  the  best  generalization  allowed  by  the  choice  of  classifier  paradigm  as  long  as  the 
training  sample  size  is  large;  this  guarantee  also  holds  for  small  training  sample  sizes  when  the  classifier  is 
an  “improper  parametric  model”'  of  the  data  (as  it  often  is).  Differentia)  learning  requires  the  classifier  with 
the  minimum  functional  complexity  necessary  —  under  a  broad  range  of  accepted  complexity  measures  — 
for  Bayesian  (i.e.,  minimum  probability-of-error)  discrimination. 

The  theory  is  demonstrated  in  several  real-world  machine  leamingfpattem  recognition  tasks  associated 
with  optical  character  recognition,  medical  diagnosis,  and  airborne  remote  sensing  imagery  interpretation. 
These  applications  focus  on  the  implementation  of  differential  learning  and  illustrate  its  advantages 
and  limitations  in  a  series  of  experiments  that  complement  the  theory.  The  experiments  demonstrate 
that  differentially-generated  classifiers  consistently  generalize  better  than  their  probabilistically-generated 
counterparts  across  a  wide  range  of  real-world  leaming-and-classification  tasks.  The  discrimination 
improvements  range  from  moderate  to  significant,  depending  on  the  statistical  nature  of  the  learning  task  and 
its  relationship  to  the  functional  basis  of  the  classifier  used. 


See  derinition  3.14. 
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1.2  Intended  Audience 

The  material  in  (his  text  is  intended  for  researchers  in  (he  areas  of  statistical  pattern  recognition  and  machine 
learning.  At  the  very  least  this  includes  researchers  from  the  fields  of  statistics,  electrical  and  computer 
engineering,  and  computer  science.  Within  each  of  these  broad  fields  there  are  disciplines  that  have  generated 
their  own  culture  and  technical  Jargon;  the  terminology  of  one  culture  does  not  always  match  that  of  another, 
so  there  are  inherent  problems  associated  with  any  attempt  to  reach  a  wide  audience  with  a  single  message. 
Nevertheless,  we  shall  try. 

To  (his  end  we  combine  elements  and  notation  of  estimation  theory,  statistical  pattern  recognition, 
information  theory,  and  computational  learning  theory;  we  exploit  the  more  expressive  aspects  of  each 
discipline  in  order  to  articulate  our  message  clearly.  We  therefore  employ  a  mixture  of  the  notational 
conventions  of  [15, 45, 29,  1 17, 100],  among  others;  appendix  A  provides  a  glossary  of  notation. 

Although  we  have  endeavored  to  make  the  material  accessible  to  a  broad  audience,  the  text  assumes 
that  the  reader  has  a  basic  understanding  of  probability  and  statistics  and  a  familiarity  with  the  terminology 
commonly  used  in  the  pattern  recognition  literature  (e.g.,  [29]).  This  terminology  is  sometimes  at  odds  with 
that  of  other  disciplines.  The  most  notable  example  is  the  word  class',  in  real  analysis,  measure  theory,  and 
computational  learning  theory,  the  term  is  synonymous  with  “set”;  in  the  pattern  recognition  literature,  the 
term  is  synonymous  with  “concept”.  We  generally  mean  “concept”  (not  “set”)  when  we  'ise  the  term 
class.  At  the  same  time,  we  use  the  term  hypothesis  class  —  computational  learning  theory  jargon  — 
when  referring  to  the  set  of  all  possible  classifiers  that  might  be  generated  by  a  learning  strategy.  Computer 
scientists  will  note  that  we  use  the  term  “search”  to  denote  “numerical  optimization  procedure”.  Strictly 
speaking,  a  numerical  optimization  procedure  is  a  specific  type  of  search  (i.e.,  one  that  takes  place  over  a 
continuous,  differentiable  function  on  an  uncountable  space  of  independent  variables);  thus,  our  terminology 
is  somewhat  imprecise.  We  have  kept  these  kinds  of  inconsistencies  to  a  minimum,  committing  them  only 
when  we  feel  economy  of  words  and/or  the  historical  precedence  of  a  particular  research  field  warrants. 

13  Outline  of  the  Text 

Up  through  the  first  half  of  this  century,  statistical  pattern  recognition  was  generally  done  with  so-called 
“parametric”  models.  Tlie  parametric  model  assumes  that  the  feature  vector  (or  attribute  vector)  has  a 
particular  probabilistic  form,  so  the  process  of  learning  is  simply  one  of  choosing  the  set  of  parameters  that 
maximizes  the  likelihood  that  the  observed  data  could  have  been  generated  by  the  rrrodel;  logistic  regression 
is  arguably  the  best-known  example.  Parametric  models  are  generally  simple  paradigms  that  lend  themselves 
to  detailed  analysis.  Their  simplicity  is  appealing  in  terms  of  the  analytical  tractability  it  affords,  but  it 
enforces  a  restrictive  probabilistic  view  of  the  world  that  is  not  always  valid. 

The  computer  age  has  brought  us  the  power  to  explore  less  restrictive  “non-parametric”  models,  which 


1.3 


Outline  of  the  Text 


3 


make  no  explicit  assumptions  regarding  the  underlying  probabilistic  nature  of  the  world.  Parzen  windows, 
decision  trees,  and  neural  network  classifiers  are  three  popular  examples  of  non-parametric  models.  The 
adjective  “non-parametric”  is  ironic  (and  in  our  opinion  unfortunate)  because  it  falsely  implies  that  such 
models  have  no  parameters.  Of  course  all  models  have  parameters  to  the  extent  that  they  encode  a  definition 
of  the  classifier  in  some  tangible  form.  This  leads  us  to  view  all  models  as  parametric  and,  furthermore, 
to  make  a  clear  distinction  between  proper  and  improper  parametric  models.^  A  proper  parametric  model 
has  a  specific  probabilistic  form;  when  parameterized  correctly,  it  is  an  exact  exprcsF'on  of  the  feature 
vector’s  probabilistic  nature.  An  improper  parametric  model  is  not  a  valid  expression  of  the  feature  vector’s 
probabilistic  nature,  whether  or  not  such  an  expression  exists. 

This  text  is  motivated  by  three  convictions.  The  first  is  that  it  is  not  necessary  to  estimate  probabilities 
in  order  to  perform  robust  statistical  pattern  recognition.  The  second  conviction  is  that  many  real-world 
pattern  recognition  tasks  do  not  have  a  proper  parametric  model;  among  those  that  do,  the  proper  parametric 
model  may  not  be  readily  discernible.  The  third  conviction  is  expressed  in  Occam’s  razor,  the  celebrated 
folk  theorem-^  that  asserts,  “the  simplest  model  of  the  data  is  the  best  one.”  Ultimately  then,  the  task  of 
learning  to  perform  robust  statistical  pattern  recognition  becomes  a  process  of  making  the  most  of  the  least: 
we  strive  to  generate  the  best  classifier  we  can  from  the  simplest  model  that  will  do.  Differential  learning  is 
a  theoretically  defensible  means  of  achieving  this  goal  consistently  —  an  assertion  we  support  with  proofs 
and  illustrative  experiments. 

1.3.1  Summary  of  Findings 

All  of  our  findings  follow  from  two  premises: 

ProbabOistic  versus  discriminative  learning  —  There  are  at  least  two  approaches  to  learning  stochastic 
concepts  for  statistical  pattern  recognition:  probabilistic  and  discriminative.  Probabilistic  learning  strategies 
seek  to  learn  the  a  posteriori  class  probabilities  of  the  feature  vector  over  its  domain,  whereas  discriminative 
learning  strategies  seek  only  to  learn  the  identity  of  the  most  likely  class  at  each  point  on  the  feature 
vector’s  domain  (equivalently,  discriminative  learning  strategies  seek  only  to  learn  the  Bayes-optimal  class 
boundaries  on  the  feature  vector’s  domain).  Both  of  these  strategies  can  be  employed  with  differentiable 
supervised  dossiers,  which  form  their  input-to-output  mappings  by  adjusting  a  set  of  internal  parameters  via 
an  iterative  search  aimed  at  optimizing  a  differentiable  objective  function  (or  empirical  risk  measure).  The 
objective  function  is  a  metric  that  evaluates  how  well  the  classifier’s  evolving  mapping  reflects  the  empirical 
relationship  between  the  input  patterns  of  the  training  sample  and  their  class  membership,  modeled  by  the 
classifier’s  discriminant  functions.  Error  measure  objective  functions  and  the  classification  figure-of-merit 

^The  term  “proper  model"  probably  originates  with  Dawes  (26);  .see  section  .1.6.2. 

^Occam's  razor  is  formalized  in  the  notion  of  universal  probability  (e.g.,  (21,  pg.  160])  and  in  VC  theory  (I.Vt,  136). 
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(CFM  objective  function)  [55]  —  both  described  in  chapter  2  —  induce  different  kinds  of  learning.  Error 
measures  induce  probabilistic  learning,  whereas  the  CF^  objective  function  induces  differential  learning,  a 
form  of  discriminative  learning  appropriate  for  the  differentiable  supervised  classifier. 

The  efficient  classifier  and  efficient  learning  —  We  provide  rigorous  estimation-theoretic  definitions 
of  the  efficient  classifier,  in  simple  terms,  it  consistently  exhibits  the  lowest  error  rate  possible  for  a 
given  leaming/classification  task;  no  other  classifier  generalizes  better.  The  relatively  efficient  classifier  is 
analogous  to  the  efficient  classifier  with  one  qualification;  the  relatively  efficient  classifier  generalizes  better 
than  any  other  classifier  drawn  from  a  limited  set  of  possibilities.  We  refer  to  this  limited  set  of  possibilities 
as  the  classifier’s  hypcthesis  class  (e.g.,  [100]).  The  differentiable  supervised  classifier  can  be  viewed  as  a 
Bayesian  learning  paradigm  because  its  discriminator’s  initial  (or  prior)  parameterization  is  transformed  to 
a  posterior  parameterization  during  learning.  Given  a  particular  training  sample  size,  a  particular  choice  of 
discriminant  functions  (i.e.,  hypothesis  class),  and  a  particular  initial  parameterization,  the  transformation 
depends  entirely  on  the  learning  strategy  employed.  An  efficient  learning  strategy  generates  the  relatively 
efficient  classifier  described  above  for  both  small  and  large  training  sample  sizes.  An  asymptotically  efficient 
teaming  strategy  requires  large  training  sample  sizes  to  guarantee  the  relatively  efficient  classifier. 

Principal  Theoretical  Findings 
We  prove  the  following; 

•  Gassifiers  that  learn  by  minimizing  error  measure  objective  functions  (e.g.,  mean-squared  error, 
the  Kullback-Leibler  information  distance  — a.k.a.  “cross  entropy’’  — [82,  81],  etc.)  learn 
probabilistically.  Again,  the  classifier  that  learns  probabilistically  attempts  to  learn  the  a  posteriori 
probabilities  of  the  feature  vector  over  its  domain. 

•  Learning  probabilistically  by  minimizing  error  measure  objective  functions  rarely  generates  the 
relatively  efficient  classifier.  As  a  result,  probabilistic  learning  is  usually  inefficient. 

•  Gassifiers  that  learn  by  maximizing  the  G^  objective  function  learn  differentially.  Again,  the 
classifier  that  learns  differentially  attempts  to  learn  only  the  most  likely  class  of  the  feature  vector  over 
its  domain. 

•  Learning  differentially  by  maximizing  the  synthetic  CFM  objective  function  described  herein  always 
generates  the  relatively  efficient  classifier  for  large  training  sample  sizes.  As  a  result,  differential 
learning  is  asymptotically  efficient. 

•  Learning  differentially  by  maximizing  the  synthetic  CFM  objective  function  usually  generates  the 
relatively  efficient  classifier  for  small  training  sample  sizes  as  well.  As  a  result,  differential  learning  is 
usually  efficient. 
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•  Learning  differenlially  via  the  CFM  objective  function  requires  discriminant  functions  with  the  least 
functional  complexity  (e.g.,  (he  fewest  parameters)  necessary  for  Bayesian  (i.e.,  minimum  probability- 
of-error)  discrimination. 

•  information-theoretic  analysis  proves  that  the  (raining  sample  sizes  necessary  to  guarantee  a  specified 
level  of  generalization  via  differential  learning  are  typically  orders  of  magnitude  smaller  than  those 
necessary  to  estimate  probabilities  with  a  specified  level  of  precision.  This  indicates  that  current 
probabilistic  extensions  of  the  PAC  learning  paradigm  [133]  to  stochastic  concepts  on  uncountable 
feature  vector  domains  (e.g.,  [59,  60,  1461)  afc  likely  to  over-estimate  the  training  sample  sizes 
necessary  for  good  generalization  when  the  learning  objective  is  merely  pattern  classification. 

Experimental  Findings 

We  apply  differential  learning  to  several  real-world  machine  learn ing/pattem  recognition  tasks  associated  with 
optical  character  recognition,  medical  diagnosis,  and  airborne  remote  sensing  imagery  interpretation.  In  each 
task  the  differentially  generated  classifier  generalizes  better  than  its  probabilistically  generated  counterpart 
The  discrimination  improvements  range  from  moderate  to  significant  depending  on  the  statistical  nature  of 
the  learning  task  and  its  relationship  to  the  functional  basis  of  the  classifier  used.  In  general,  differential 
learning  exhibits  the  following  characteristics: 

•  Differential  learning  allows  classifiers  with  1/2  —  l/IO  the  number  of  parameters  used  in  the  best 
independently-developed  models  for  each  task. 

•  The  error  rates  of  differentially-generated  classifiers  are  20%  —50%  less  than  those  of  the  best 
independently-developed  models. 

•  The  error  rates  of  differentially-generated  classifiers  are  30%  —  80%  less  than  those  of  probabilistically- 
generated  control  models. 

•  Differentially-generated  classifiers  are  between  two  and  ten  times  more  efficient  than  their 
probabilistically-generated  counterparts. 

1.3.2  Profile  of  the  Chapters 

The  preceding  findings  are  organized  in  three  basic  parts;  theoretical  findings  are  described  in  part  I; 
experimental  findings  are  described  in  part  II;  supporting  material  is  detailed  in  the  appendices.  We  profile 
the  contents  of  parts  I  and  II  in  the  following  sections. 
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Part  I:  Theory 

Chapter  2:  We  define  the  differentiable  supervised  classifier  in  tenns  of  its  discriminator,  the  set  of 
discriminant  functions  that  map  the  feature  vector  to  a  set  of  possible  classifications.  We  discuss  the 
Bayes-optimal  classifier;  we  refer  to  its  discriminator  as  the  Bayesian  discriminant  function  (BDF).  Wc 
show  that  the  BDF  has  two  fundamental  forms  that  correspond  to  two  fundamental  approaches  to  learning: 
probabilistic  learning  seeks  to  learn  the  a  posteriori  class  probabilities  of  the  feature  vector  over  its  domain; 
differential  learning  merely  seeks  to  learn  the  identity  of  the  most  likely  class  over  the  feature  vector’s  domain. 
We  prove  that  both  strategies  generate  the  Bayes-optimal  classifier,  given  a  sufficiently  large  training  sample 
size  and  a  classifier  with  sufficient  functional  complexity  to  learn  the  appropriate  form  of  the  BDF.^ 

Chapter  3:  We  characterize  the  classifier  as  an  estimator  of  the  Bayes-optiinal  classifier.  We  define  the 
efficient  classifier  in  terms  of  three  metrics,  discriminant  bias,  discriminant  variance,  and  mean-squared 
discriminant  error  (MSDE).  These  metrics  are  shown  to  be  quite  different  from  the  functional  bias,  variance, 
and  mean-squared  error  metrics  that  are  commonly  discussed  in  the  literature.  The  efficient  classifier 
generalizes  better  than  any  other  classifier,  exhibiting  the  lowest  possible  MSDE  for  a  given  training  sample 
size.  The  relatively  efficient  classifier  is  analogous,  with  one  caveat:  the  relatively  efficient  classifier 
generalizes  better  than  any  other  classifier  drawn  from  a  limited  set  of  possibilities  (i.e.,  the  hypothesis 
class).  The  relatively  efficient  classifier  is  identically  the  efficient  classifier  if  the  latter  is  contained  in 
the  hypothesis  class;  otherwise,  the  relatively  efficient  classifier  is  the  best  approximation  to  the  efficient 
classifier  allowed  by  the  hypothesis  class.  An  efficient  learning  strategy  always  generates  the  relatively 
efficient  classifier,  regardless  of  the  training  sample  size.  An  asymptotically  efficient  learning  strategy  always 
generates  the  relatively  efficient  classifier,  but  requires  large  training  sample  sizes  to  do  so.  We  prove  that 
differential  learning  is  asymptotically  efficient  and  that  probabilistic  learning  is  inefficient.  We  then  prove 
that  differential  learning  generates  the  Bayes-optimal  classifier  from  the  hypothesis  class  with  the  minimum 
functional  complexity  necessary  for  the  task.  Probabilistic  learning  generally  requires  a  hypothesis  class  with 
greater  functional  complexity  to  generate  the  Bayes-optimal  classifier.  We  conclude  the  chapter  by  outlining 
the  special  conditions  under  which  probabilistic  learning  is  more  efficient  than  differential  learning.  These 
conditions  can  exist  only  if  the  hypothesis  class  constitutes  a  proper  parametric  model  of  the  feature  vector. 

Chapter  4:  We  illustrate  the  proofs  of  chapter  3  with  two  simple  leaming/classification  tasks  that  lend 
themselves  to  closed-form  analysis.  The  first  illustration  involves  a  proper  parametric  model;  it  shows 
the  special  circumstances  under  which  probabilistic  learning  generates  a  more  efficient  classifier  than 
differential  learning  does  when  the  training  sample  size  is  small.  The  second  illustration  shows  the  typical 
leaming/classification  scenario  wherein  the  hypothesis  class  constitutes  an  improper  parametric  model.  In 


^We  address  the  issue  of  classifler  complexity  in  chapter  3. 
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this  circumstance,  differential  learning  generates  the  relatively  efficient  classifier  for  both  small  and  large 
training  sample  sizes,  whereas  probabilistic  learning  generally  fails  to  do  so  for  any  training  sample  size. 

Chapter  5:  We  discuss  the  CFM  objective  function  in  terms  of  its  monotonicity  and  the  convergence 
properties  it  engenders  in  the  differential  learning  strategy.  Monotonicity  proves  to  be  an  essential 
characteristic  of  any  objective  function  associated  with  an  efficient  learning  strategy.  In  describing  this 
property  we  invoke  a  differential  view  of  the  discriminator’s  output  state  (i.e.,  the  state  of  the  classifier's 
discriminant  functions),  which  we  employ  frequently  in  the  experiments  of  part  II.  The  view  is  a  2-dimensional 
representation  that  is  consistent  with  the  differential  form  of  the  BDF  described  in  chapter  2.  It  leads  to  a 
simple  taxonomy  of  training  examples  and  an  equally  simple  geometric  explanation  of  the  difference  between 
differential  and  probabilistic  learning  strategies.  We  conclude  the  chapter  by  proving  that  differential  learning 
via  the  synthetic  CFM  objective  function  is  reasonably  fa.st.  That  is,  the  search  for  parameters  that  maximize 
the  synthetic  CFM  objective  function  converges  in  reasonable  time. 

Chapter  6:  We  make  a  clear  distinction  between  the  probabilistic  information  content  and  the  discriminant 
information  content  of  a  randomly-selected  training  sample.  We  show  that  a  simple  unfair  (or  “rigged”) 
game  of  dice  forms  the  basis  of  all  leaming/statistical  pattern  recognition  tasks.  We  analyze  this  game  in  order 
to  prove  that  a  random  sample’s  discriminant  information  content  is  always  at  least  as  great  as  its  probabilistic 
information  content  The  information-theoretic  argument  relies  on  Rissanen’s  notion  of  stochastic  complexity 
(e.g.,  [IIS])  and  can  be  viewed  as  an  extension  of  the  chapter  3  proof  that  differential  learning  requires  the  least 
functional  complexity  necessary  to  learn  the  Bayes-optimal  classifier.  We  derive  tight  distribution-dependent 
bounds  on  the  training  sample  size  and  (information-theoretic)  functional  complexity  requirements  of  the 
diffe.'vntial  and  probabilistic  learning  strategies.  We  show  that  the  rigged  game  of  dice  extends  naturally 
to  pattern  recognition  tasks  for  which  the  feature  vector  exists  on  a  finite  countable  domain.  A  further 
extension  of  the  paradigm  brings  us  to  the  general  case  in  which  the  feature  vector  exists  on  a  potentially 
infinite  uncountable  domain.  We  conclude  by  discussing  the  limitations  of  the  rigged-die  paradigm  when  it 
is  generalized  to  the  uncountable  feature  vector  space. 

Part  n;  Applicatioiis 

Chapter  7:  We  discuss  the  pragmatic  issues  that  must  be  addressed  in  order  to  implement  differential 
learning  successfully.  We  do  this  with  the  aid  of  the  celebrated  Iris  data,  collected  by  E.  Anderson  [3] 
and  subsequently  used  by  R.  A.  Fisher  in  his  seminal  paper  on  linear  discriminants  [34].  Differential 
learning  allows  a  natural  partitioning  of  the  training  sample  into  three  sub-sets:  un-leamed  examples,  learned 
examples,  and  transition  examples.  The  first  two  categories  are  self-explanatory;  the  third  category  comprises 
those  examples  that  are  neither  un-leamed  nor  learned,  but  are  “in  between’’  those  two  states.  We  describe 
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the  nature  of  confidence  as  it  pertains  to  differential  learning.  We  show  that  the  differential  paradigm 
focuses  on  un-leamed  examples  without  dwelling  on  learned  examples.  We  contrast  differential  learning 
with  two  forms  of  probabilistic  learning  associated  with  the  mean-squared  error  (MSE)  objective  function 
and  the  Kullback-Leibler  information  distance  [82,  81].  We  view  the  impact  of  differential  learning  on  the 
classifier’s  ability  to  detect  and  reject  specious  classifications.  We  conclude  by  demonstrating  the  efficiency 
of  differential  learning  by  leaming/classifying  the  Iris  with  three  substantially  different  hypothesis  classes: 
the  differentially-generated  clas'itiers  consistently  generalize  better  than  their  probabilistic  counterparts,  as 
predicted  by  the  proofs  of  part  I. 


Chapter  8:  We  expand  upon  the  findings  of  chapter  7  with  an  optical  character  recognition  (OCR)  task 
involving  the  AT&T  DBI  handwritten  digit  database.  The  database  has  been  studied  extensively  by  other 
researchers,  so  it  provides  a  good  benchmark  for  evaluating  differential  learning.  We  begin  with  a  description 
of  the  controlled  experimental  protocols  we  use  throughout  the  text  when  comparing  differential  learning 
with  probabilistic  learning.  We  show  that  classifier’s  generated  differentially  fiom  three  substantially 
different  hypothesis  classes  generalize  better  than  their  probabilistically-generated  counterparts;  the  disparity 
increases  signiflcantly  when  the  OCR  images  are  compressed  in  order  to  reduce  the  classifiers’  complexities. 
The  diffenentialiy-generated  classifiers’  error  rates  are  typically  2%  to  4%;  their  probabilistically-generated 
counterparts’  error  rates  range  from  3%  to  12%.  By  adding  noise  to  the  OCR  images  prior  to  compression,  we 
induce  conditions  under  which  one  particular  choice  of  hypothesis  class  approximates  the  proper  parametric 
model  of  the  noisy  digits.  As  predicted  by  chapter  3,  the  classifier  generated  probabilistically  from  this  proper 
parametric  model  generalizes  better  than  its  differentially-generated  counterpart  for  small  training  samples 
of  very  noisy  digits. 


Chapter  9:  We  repeat  the  experiments  of  Manduca,  Christy,  and  Ehman  [90]  in  which  avascular  necrosis 
(AVN)  of  the  femoral  head  (a  debilitating  hip  joint  disorder)  is  diagnosed  from  magnetic  resonance  images 
(MRIs)  using  neural  network  classifiers.  We  compare  the  diagnostic  accuracy  of  a  simple  differentially- 
generated  classifier  and  two  probabilistically-generated  controls  (including  the  logistic  regression  model) 
with  their  original  results.  When  presented  with  approximately  sixty  training  images  and  subsequently 
evaluated  on  the  same  number  of  test  images,  the  differentially-generated  classifier  discriminates  between 
healthy  and  AVN  compromised  femoral  heads  with  a  S.9%  error  rate.  'This  error  rate  is  slightly  lower  than  the 
7.5%  error  rate  of  humans  without  formal  training  in  radiology,  reported  in  [90].  The  differentially-generated 
logistic  linear  classifier  generalizes  better  than  the  probabilistic  controls  and  the  best  previous  neural  network 
classifier,  a  multi-layer  perception  having  approximately  24  times  the  number  of  parameters  (6,164,  versus 
257  for  our  classifier)  [90]. 


Chapter  10:  We  describe  a  series  of  remote  sensing  experiments  conducted  in  collaboration  with  the 
Digital  Mapping  Laboratory,  School  of  Computer  Science,  Carnegie  Mellon  University.  We  use  a  modified 
RBF  classifier  employing  differential  learning  (DRBF)  to  interpret  multi-spectral  imagery  from  the  Daedalus 
airborne  remote  sensing  system.  The  interpretation  procedure  involves  classifying  individual  image  pixels, 
which  represent  64  square  meters  of  earth  surface  material,  into  eleven  categories  of  natural  and  man-made 
materials  — a  preliminary  step  in  automated  map  generation  and  various  environmental  analysis  tasks. 
The  DRBF  classifier  has  132  parameters  and  exhibits  a  29%  error  rate  on  the  interpretation  task.  The 
maximum-likelihood  (probabilistic)  model  currently  used  for  this  task  has  847  parameters  and  exhibits  a  46% 
error  rate. 

Chapter  11:  We  state  our  contributions  to  the  fields  of  machine  learning  and  statistical  pattern  recognition. 
We  then  discuss  the  philosophical  implications  of  our  research.  We  conclude  with  an  outline  of  future 
research  that  follows  naturally  from  what  we  have  accomplished  to  date. 

Appendices 

Most  of  the  appendices  are  explanations  of  issues  that  ate  not  essential  to  the  main  text,  but  a  few  appendices 
contain  essential  material  worth  mentioning  here.  Appendix  A  provides  a  glossary  of  notation  used  throughout 
the  text.  Terminology  is  explained  throughout  the  text.  The  reader  can  find  references  to  terms  via  the 
index;  boldface  page  numbers  indicate  the  page  on  which  a  term  is  defined  or  explained  most  thoroughly. 
Appendix  D  provides  details  of  the  synthetic  CFM  objective  function,  including  ANSI  C  source  code  for 
the  function  and  its  first  two  derivatives.  This  appendix  also  contains  a  tutorial  explanation  of  how  the 
backpropagation  algorithm  [  I!  9, 1 20]  can  be  modified  for  use  with  CFM.  Appendix  E  explores  the  similarities 
and  differences  between  differential  learning  via  the  CFM  objective  function  and  learning  via  Rosenblatt’s 
perceptron  criterion  function  1116).  The  reader  familiar  with  learning  via  the  perceptron  criterion  function 
might  find  this  material  a  helpful  introduction  to  differential  learning  via  the  CFM  objective  function  since 
the  latter  can  be  viewed  as  a  generalization  of  the  former. 
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Chapter  2 


Probabilistic  and  Differential  Strategies 
for  Learning  the  Bayesian  Discriminant 
Function ' 


Outline 

We  describe  two  learning  strategies  by  which  a  broad  category  of  pattern  classifiers  (including,  but  not 
limited  to,  multi-layer  perception  and  radial  basis  function  neural  networks)  can  learn  to  perform  Bayesian 
discrimiiMtion  (i.e.,  minimum-error  statistical  pattern  recognition).  The  probabilistic  learning  strategy  is 
associated  with  error  measure  objective  functions  such  as  mean  squared  error  and  the  Kullback-Leibler 
information  distance;  it  engenders  Bayesian  discrimination  by  estimating  probabilities.  Hie  differential 
learning  strategy  is  associated  with  classification  figure-of-merit  objective  functions  [SS];  because  it  is  a 
discriminative  strategy,  it  engenders  Bayesian  discrimination  without  estimating  probabilities  directly.  We 
describe  each  strategy  in  detail  as  a  preliminary  step  in  proving  that  differential  learning  is  efficient,  whereas 
probabilistic  learning  is  not. 

2.1  Introduction 

This  chapter  describes  two  supervised  strategies  for  learning  stochastic  concepts  in  order  to  perform  statistical 
pattern  recognition.  Each  of  these  strategies  can  be  applied  to  any  computational  model  (hereafter  called  the 
classifier)  that  forms  an  input-to-output  mapping  by  adjusting  a  set  of  internal  parameters  via  an  iterative 
search  aimed  at  optimizing  an  objective  function  (or  empirical  risk  measure).  The  objective  function  is  a 
metric  that  evaluates  how  well  the  classifier’s  evolving  mapping  reflects  the  empirical  relationship  between 
the  input  patterns  of  the  training  sample  and  their  class  membership,  modeled  by  the  classifier’s  outputs. 
Optimizing  the  objective  function  via  iterative  search  on  the  classifier’s  parameter  space  is  therefore  a 
'  This  chapter  it  a  revised  and  extended  version  of  work  tint  published  in  [54], 
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mathematically  defensible  approach  to  machine  learning. 

Our  principal  objective  in  this  chapter  is  to  describe  the  specific  nature  of  two  supervised  learning  strategies 
—  probabilistic  learning  and  differential  learning  —  that  lead  to  minimum-error  Bayesian  discrimination 
when  applied  to  classifiers  of  the  form  described  above.  Because  many  neural  networks  learn  in  a  supervised 
fashion,  it  is  our  hope  that  the  following  proofs  will  be  of  general  interest  to  the  connectionist  community. 
Our  presentation  begins  with  an  overview  of  Bayesian  discrimination,  which  provides  the  framework  upon 
which  the  main  proofs  are  built.  Our  secondary  objective  is  to  show  that  these  proofs  apply  to  a  broad 
spectrum  of  machine  learning  paradigms  —  not  all  of  which  are  typically  associated  with  statistical  pattern 
recognition.  We  do  this  by  introducing  them  within  an  historical  context  that  views  connectionist  models  as 
natural  extensions  of  more  traditional  classifier  paradigms. 

In  this  chapter  and  all  that  follow,  we  combine  the  elements  and  notation  of  statistical  pattern  recognition 
and  computational  learning  theory  in  order  to  exploit  the  more  expressive  aspects  of  each  discipline  and 
present  a  succinct  set  of  proofs.  We  employ  a  mixture  of  the  notational  conventions  of  (45,  29,  117,  100]: 
appendix  A  provides  a  glossary  of  notation. 

We  define  and  contrast  probabilistic  and  differential  learning  in  terms  of  the  functional  forms  of  the 
Bayesian  discriminant  function  (e.g.,  [29,  sec.  2.S])  they  generate.  Simply  stated,  probabilistic  learning 
yields  classifiers  that  estimate  the  class  probabilities  for  a  given  input  pattern,  whereas  differential  learning 
yields  classifiers  that  merely  identify  the  most  likely  class  for  a  given  input  pattern.  Each  learning  strategy 
is  associated  with  a  family  of  objective  functions.  Proofs  of  varying  generality  and  rigor  linking  specific 
objective  functions  to  what  we  call  probabilistic  learning  are  not  new.  Many  authors  have  shown  this  linkage 
for  the  mean-squared-error  (MSE)  objective  function^  [103, 29, 7,  142,  86, 42,  118,  17,  138,  125,  70],  while 
others  have  shown  it  for  the  Kullback-Leibler  information  distance  (a.k.a.  "cross  entropy":  CE)  and/or 
closely  related  error  measures  [82,  81,  67,  II,  127,  142,  64,  42,  31].  Simulations  demonstrating  the  validity 
of  those  proofs  for  neural  network  classifiers  can  be  found  in  (H  I}.  We  prove  that  both  of  these  objective 
functions  belong  to  a  broad  family  of  error  (or  distance)  measures  that  engender  probabilistic  learning.  We 
then  prove  that  classification  figure-of-merit  (CFM)  objective  functions  [55]  engender  differential  learning. 

2.2  Bayesian  Discrimination 

Consider  a  random  vector  (RV)  X,  which  exists  on  the  domain  of  real-valued  A/-dimensional  vectors: 
X  e  X  =  -  X  KXba  feature  vector  {ot  attribute  vector),  which  can  represent  any  one  of  C  classes  (pt 

concepts)  ^  an  feature  vector  space  X  •  Feature  vector  space  X  paired  with  a  class  label  (or  classification) 
space  n  =  {U/|,  ...  ,UJc}  •  We  denote  the  ith  class  of  X  by  Ulj.  The  feature  vector  X  is  always 

^The  MSE  olqective  function  is  also  called  the  least-mean-squared  (LMS)  and  the  least-squared  error  (LSE)  objective  function. 

^TMs  tmcounttbly  infinite  definition  of  feature  vector  space  includes  mote  restrictive  definitions,  such  as  the  set  of  all  integer-valued 
Af-dimensional  vectors  and  the  set  of  all  Af-luples  (0, 1  .  The  following  proofs  therefore  apply  to  more  restrictive  definitions  of 
the  feature  vector,  involving  finite  and/or  countable  spaces,  without  loss  of  generality. 
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Figure  2. 1 :  Left:  The  class-conditional  density  —  class  prior  probability  products  />x|w(^l^')  '  Pvv{^*^/) 
for  a  three-class  random  scalar  x .  Right:  The  associated  a  posteriori  probabilities  Pw|x(^i  I  x)  for  each  of 
the  three-classes,  plotted  over  the  effective  domain  of  x .  These  constitute  the  strictly  probabilistic  form  of 
the  Bayesian  discriminant  function  for  x  (see  definition  2.3). 


paired  with  a  class  label  W  €  1 2 ;  we  denote  the  pairing  by  (X ,  W) .  Because  the  classes  defined  on  \ 
are  stochastic,  W  for  a  given  X  is  not  deterministic;  rather  it  is  a  random  variable,  examples  of  which 
are  generated  according  to  the  conditional  distribution  Pw|x(W  |  X)  over  il .  Thus,  the  probability  that  an 
example  of  X  constitutes  an  example  of  the  ith  class  02, •  is  Pvv|x(^i  |  X) .  We  refer  to  Pw|x(t^i  I X)  as 
the,  “a  posteriori  probability  of  the  ith  class  (given  X).” 

There  exists  some  means  of  obtaining  examples  of  X ,  which  are  generated  according  to  the  probability 
density  function  (pdO  /?x(X)  over  “X,  •  Examples  of  X  representing  the  ith  class  are  generated  according 
to  the  class-conditional  pdf  /3x|vv(X|C«2,)  over  X  such  that 

c 

^x(X)  =  ^  ^  •  Pw{Ulj)  (2.1) 

y=i 

The  “prior”  distribution  of  classes  on  ,  denoted  by  Pw(W)  in  (2.1),  is  obtained  by  integrating  the  joint 
pdf  Px,w(X>  W) ,  which  is  over  the  joint  space  ^  t  over  X  • 


Pwm  =  J  Px,>v(X.W)dX 

By  Bayes’  rule,  the  a  posteriori  probability  of  the  ith  class  is 


(2.2) 


Pw|x(U2/|X) 


f^x,>v(X»tx2f) 

PxW 
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S;=l  /^X|iv(^l^y)  ■  Pw(t^/) 

Example  2.1  Figure  2.1  provides  a  concrete  example  of  a  simple  three-class  pattern  recognition  task 
described  by  the  mathematical  formalism  above.  In  this  flguie  the  feature  vector  is  actually  a  scalar,  so 
feature  vector  space  is  the  real  number  line;  x  €  X  =  ®  •  This  random  variable  can  represent  one  of 
three  classes  on  classification  space:  W  €  12  =  The  left-hand  side  of  the  figure  shows  the 

class-conditionalpdfsofx  multiplied  by  the  class  prior  probabilities  •  Pw(U^,);  /  =  1,2,3) 

in  order  of  increasing  class  index,  from  bottom  left  to  top  right.  The  right-hand  side  of  the  figure  shows  the 
corresponding  a  posteriori  class  probabilities  ( Pw(,(Ct>,  |x) ;  i  =  1, 2, 3 )  over  the  effective  domain  of  x . 
The  bar-graph  display  at  the  bottom  of  the  left  and  right  figures  is  described  in  example  2.3. 


2.2.1  The  Classifler  and  the  Bayesian  Discriminant  Function 

Pattern  recognition,  discrimination,  or  classification  is  the  process  by  which  the  classifier  associates  a  class 
(or  concept)  label  with  each  example  of  the  feature  vector  presented  to  it.  For  this  reason,  the  classifier 
implements  a  set  of  C  deterministic  functional  mappings  (known  as  the  discriminant  functions)  from  feature 
vector  space  “X.  to  discriminator  output  space  *  .  This  set  of  discriminant  functions  Q(X\0)  is  known 

collectively  as  the  discriminator.  The  discriminator  output  Y  =  (vi ,  •  •  •  ,yc)  exists  on  the  domain  of 
real-valued  C-dimensional  vectors  ^  ( Y  €  y  =  )  such  that 


a:  X  -t  Y; 

g{x\e)  ^  {g,{x\0),...,gc{x\0)}. 


(2.4) 


where 


gi  -.X^  y,  (y,  €  R)  Vr 


(2.5) 


The  argument  0  in  g{X\0)  and  gf(X|0)  indicates  that  the  discriminant  functional  mappings  depend  on 
the  parameterization  0  of  the  discriminator. 


^The  fonowi^  definition  of  a  cUnifier  end  its  aMOciled  fanctioiul  mappinp  asiimiet  a  one-lo-one  coneapondenoe  between  the 
mnnber  of  ditcnminalaraolpait  and  the  number  of  clasae*  C.  The  prooft  that  follow  rely  on  Ihia  conventional  (lather  than  neoeaaary) 
aaawnption.  Other  asaompdona  are  equally  valid.  Aa  an  example,  a  ciaaaifier  wnh  flogjlC]!  diactimlnator  outputs  can  recognire  C 
claasea.  In  such  a  case  the  following  proofs  will  hold,  given  appropriate  modifications  to  account  for  the  altered  di^minator  output 
apace  y. 

’As  with  feature  vector  apace,  this  uncountaMy  infinite  definition  of  discriminator  output  apace  includes  more  restrictive  definitions 
involving  countable  and/or  finite  spaces.  Examplea  of  such  spaces  are  the  finite  uncountaM  apace  y  =  (0, 1]*^  associated  with 
mohi-layer  peroeptrons  having  output  nodes  with  logistic  (i.e.,  differentiiMe  sigmoidal)  non-linearities,  and  the  finite  countable  space 
y  =  {0,1^  associated  with  decision  trees. 
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Discriminator 


Figure  2.2:  A  diagrammatic  view  of  the  classifier  and  its  associated  functional  mappings.  The  classifier 
input  is  a  feature  vector  X;  the  C  discriminator  outputs  y\.,  ...,yc  correspond  to  the  classes  that  X  can 
represent;  the  class  label  'D{X\$)  assigned  to  the  input  feature  vector  corresponds  to  the  discriminator’s 
largest  output.  Figure  based  on  figure  2.3  of  Duda  &  Hart  (29]. 


The  final  classification  D  (X 1 0)  of  X  is  obtained  by  a  mapping  F  from  discriminator  output  space 
y  to  classification  space  (2.  This  mapping  associates  X  with  the  class  label  corresponding  to  the 
discriminator’s  largest  output^  If  two  or  more  discriminator  outputs  are  equal  and  larger  than  the  rest,  the 
mapping  yields  a  set  of  possible  classifications,  corresponding  to  all  of  the  top  outputs: 

r :  Y  W 

{Uj  :  yi  =  maxy  yj ,  y*  <  y,  k  ^  i 

(2.6) 

{Cc7,  :  y,  =  maxy  )y  } ,  otherwise 

s.t.  V{x\e)  =  r(Y)  =  r(g(X|e)) :  2)(x|e)  €  «  =  {u; . Uc)  (2.7) 

In  the  words  of  Duda  and  Hart  (29,  sec.  2.5. 1],  “the  classifier  is  viewed  as  a  machine  that  computes  C 
discriminant  functions  and  selects  the  [class]  corresponding  to  the  largest  discriminant.’’  Figure  2.2  is  based 
on  figure  2.3  of  [29],  and  illustrates  this  mathematical  notion  of  the  classifier. 

DcfinMoii  2.1  The  (miolmain-crror)  BayesKiptiinal  classifier:  It  is  straightforward  to  prove  that  the 
class^r  T>{X)  that  minimizes  the  probability  of  an  incorrect  class^ation  of  X  is  the  one  that  always 
maps  tite  feature  vector  to  its  most  probable  class  (e.g.,  129,  pp.  16-20]): 

*Tlie  fbUowing  proofs  are  necetnrily  Knked  wMi  Hits  method  of  choosing  Ihe  class  label  for  an  example.  For  classtfiers  that  do  not 
have  a  one-l«HNie  oonespondenoe  betwm  their  number  of  disetimitMor  outputs  and  the  number  of  classes,  the  discriminalor  output 
stale  representing  the  classification  of  the  example  must  be  maximal  by  a  measure  that  is  appropriate  for  the  discriminator. 
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=  LJ.  :  >  Pw|.(a^*|X)  VtJ*  5^  CJ.  (2.8) 

Any  classifier  satisfi’ing  (2.8)  is  known  as  a  Baye.s-optimal  cfassirier,  which  is  said  to  yield  (minimum-error) 
Bayesian  discrimination. 

Remark:  Equation  (2.8)  describes  a  unique  mapping  from  X  to  W .  However,  given  our  definition 
of  the  classifier  in  (2.6)  —  (2.7),  there  are  infinitely  many  discriminators  G(X  1 6)  that  implement  the 
Bayes-optimal  classifier.  Ind^d,  as  long  as  the  discriminant  function  g.(X  1 9)  associated  with  the  largest 
a  paste  !ori  class  probability  Pvv’|x(^*l^)  (^-S)  always  largest,  the  classifier  yields  Bayesian 

discrimination. 

Definition  2.2  The  Bayesian  discriminant  function  (BDF);  In  nmthematically  formal  terms,  the 
discriminator  Q{X\6)  constitutes  the  Bayesian  discriminant  function  — and  the  classifier  D  (X\d) 
yields  Bayesian  discrimination  —  if  the  classifier's  largest  output  always  corresponds  to  the  most  likely 
class,  given  X: 


V(x\e)  =  r(Y)  =  rig(x\0))  =  iff 

giW)  =  j'.; 


yt  >  .V;  :  Pvv-|ii(^/ |X)  >  P)v|,(U^y|X) 
-  yi  =  yj  •  Pw(»(<^(  |X)  =  Pvv|i(^y|X) 


P>v|»(<^i  I X)  =  max  Pw|x{t^*  I X)  I 


(2.9) 


yi  <  max  37 ,  Pw|,(a’,  |X)  <  max  Pw|,(u;ylX) 


Vi,  vx  e  X 


Remark:  The  BDF  is  of  course  a  set  of  functions  rather  than  a  single  function,  as  the  name  suggests.  Because 
there  are  infinitely  many  discriminators  that  satisfy  (2.9),  there  are  infinitely  many  discriminant  functions 
that  implement  the  Bayes-optimal  classifier  of  definition  2.1. 

2.2.2  Probabilistic  and  Differential  forms  of  the  Bayesian  Discriminant  Function 

We  group  all  Bayesian  discriminant  functions  into  four  categories.  Note  that  the  following  definitions 
consider  all  possible  forms  of  the  BDF  —  not  just  those  allowed  by  our  choice  of  discriminator  G(X  \  9) 
(we  discuss  the  difference  at  length  in  chapter  3).  We  denote  the  arbitrary  BDF  by  ^(X)Bnv.t  and  die  set  of 
all  such  BDFs  by  FBarn- 
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Definition  2.3  The  strictly  probabilistic  form  of  tbe  BDF:  This  form  of  the  RDF  is  given  hy 


^(X)Bav«  -Strictiv  Pn^xthilistic  —  ,/c.{X)}  ; 

/(X)  =  P,v|.(u;,|X)  V/ 


(2.10) 


Remark:  There  is  only  one  ^{X.)B„\es-siricttyProhahitisiic ,  which  is  uniquely  specified  by  ihe  a  posteriori  cla.ss 
probabilities  of  X . 

Definition  2.4  The  probabilistic  form  of  the  BDF;  This  form  of  the  BDF  is  given  hy 


-ProhaNfixtic  ^  {/•,(X),...  ,/,.(X)}  ; 

(2.11) 

/■(X)  =  v^'(Pwi.(a;,|X))  Vi 
where  v?(  • )  is  a  strictly  increasing  function  of  its  argument: 

^<p(z)  >0,  0  <  z  <  I  (2.12) 

dz 


We  denote  the  set  of  all  probabilistic  forms  of  the  BDF  by  ¥  Baxe^.pwhahitmtc  U-e.,  every  !F(X)Bx,„s.pwhahitimc 
is  a  member  of  ¥ Bayt,.pnMntixiic  )■ 

Remark:  There  are  innumerable  probabilistic  forms  of  the  BDF,  since  there  are  innumerable  strictly 
increasing  functions 

Exampk  2.2  Consider  the  following  four  probabilistic  forms  of  the  BDF: 

*  ^ {X)Bayts-PmhMlistic  —  -StHcfiy  Prr^Ntixtic  —  {/•(X)  =  Pn'|,(a2,|X);  i=l . C} 

•  ^i^)Bam-Pmhahilistic  —  {.A'(X)  =  3  •  P)V|ii((^r  I X)  +  1;  i=l,...,C} 

•  ^  fX)Bayts-Prohahitistic  —  {/'(X)  =  lo?  (Pw|»(02|  |  X))  ;  I  =  I,  ...  ,C} 

*  ^{^)Baytt-ProhahiUslic  —  ^/i'(X)  =  (Pw(»((^»  I X))  ;  1=1,... 

All  of  these  discriminant  functions  preserve  the  rankings  of  the  a  posteriori  class  probabilities  of  X  via  a 
strictly  increasing  transformation  of  Pw|i((^»  I X)  -4  /(X) .  Note  that  the  first  one  is  the  strictly  probabilistic 

form  of  the  BDF  ^{\)Bayes-Slrictl\  Pmhahitinic  • 
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Figure  2.3:  The  a  poffer/ori  class  differentials  Aw|x{^i|^)  for  the  three-class  random  variable  x  depicted 
in  figure  2. 1 ,  plotted  over  the  effective  domain  of  x .  These  constitute  the  strictly  differential  form  of  the 
Bayesian  discriminant  function  for  x  (see  definition  2.5). 


Although  T(X.)Bayt>-s>ricttyPr«h0hiihnic  and  {'X.)BmefPrtMhai$tic  aiB  the  most  obvious  forms  of 
other  forms  exist.  One  that  satisfies  (2.9)  is  manifest  in  any  discriminant  function  with  a 
top-ranked  member  function  /.(X)  correspondingto  the  largest  a  posrer/ori  class  probability  Pw|x(t^*  |X) 
in  (2.8).  The  two  categories  of  this  differential  form  of  the  BDF  are  analogous  to  to  the  two  probabilistic 
categories  defined  above. 

Definition  2.5  The  strictly  differential  form  of  the  BDF:  This  form  of  the  BDF  is  manifest  in  any 
discriminant  function  ^(\)Ba\v.tSiricti\DiffenMiai  with  the  following  property:  the  difference  between  the  ilh 
function  fi{X)  and  the  kth  function  fi,(\)  is  equal  to  the  difference  between  their  corresponding  a  postetion 
probabilities.  For  each  fi{\) ,  /*(X)  is  not  chosen  arbitrarily:  rather  it  corresponds  to  the  a  posteriori 
probability  I X)  =  maxy^,-  Pw\,(UJj  \  X) .  Mathematically, 


^{^)Bayes-StrialyDqfemti<g  —  t/1  (X)  ,  . . .  ,yc(X)} 

S.t.  Vi, 

fi(X)  -  MX)  =  P>v|,(a;,  |x)  -  Pw|.(a;*|x)  :  p»vi,(u^*  I  x)  =  mx  p»v|,(a;y|X) 

-  /7> 

A»vi.(Ci;,|x) 


(2.13) 


After  some  reflection  it  should  be  clear  that  the  necessary  and  sufficient  condition  for  the  discriminant function 
to  be  a  strictly  differential  form  of  the  BDF  is  that  its  member  functions  be  related  to  their  corresponding  a 
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posteriori  class  probabilities  by  a  constant  k: 

^(X)  =  iff  /(X)  =  P,v|,(a;,|X)  +  k  Vt  (2.14) 

We  denote  the  set  of  all  strictly  differential  forms  of  the  BDF  by  V  Bayes-Sinaiy  Differrmiai  ■ 

Remark:  Note  that,  by  this  and  the  preceding  definition,  ^Baxrs-StrUihDifftrtniiai  C  Fft,vr,-pmfca/»7<.«/r  •  For  this 
reason,  ff'{X)Bayn  SiriciiyDifffrftiti<ii  rcfiects  the  rankings  of  all  a  posteriori  class  probabilities  in  the  rankings 
of  its  discriminant  functions.  We  refer  to  Aw|x{^i  I X)  in  (2. 1 3)  as  the,  “(a  posteriori)  differential  of  the 
rth  class  (given  X).” 

Example  2.3  Figure  2.3  illustrates  the  a  differentials  { Aw|x(Ct2|  |  X) ,  ...  ,  A  w|x(C(23 1 X) }  that 

correspond  to  the  three-class  random  variable  depicted  in  figure  2.1.  A  review  of  these  two  figures  and 
definition  (2.5)  reveals  that  the  Bayes-optimal  class  label  LU.  in  (2.8)  is  CJ,  for  all  patterns  that  elicit  a 
non-negative  /th  class  differential  A  w|x(^i  |  X) ; 


ijj.  =  Ui  iff  Aw|,(a;,|x)  >  0 


(2.15) 


The  bar-graph  displays  at  the  bottom  of  figures  2.1  and  2.3  denote  the  most  probable  class  U2.  for  x  over 
its  effective  domain.  The  bar-graphs  also  mark  the  class  boundaries  at  x  —  +/  —  2.7 . 

Note  that  the  class  boundaries  on  ^  are  indicated  by  the  absence  of  a  positive  differential;  only  zero 
and  negative  differentials  exist  at  the  boundaries  (again,  for  example,  x  =  +/  —  2.7  in  figure  2.3).  In  such 
cases,  the  Bayes-optimal  class  label  for  a  boundary  value  of  X  —  denoted  by  Xhoundan  —  is  any  one  of 
the  classes  with  a  corresponding  a  posteriori  differential  of  zero: 


^XBo^ry  ex.  (Jd,  ^  {UH  Aw|x(U^.|Xh»,*n)  =  O} 


(2.16) 


Definition  2.6  The  difTerenUnl  form  of  the  BDF:  This  form  of  the  BDF  is  manifest  in  any  discriminant 
function  ^{\)Bayti-Diffemtiat  with  a  top-ronkcd  member  function  /.(X)  corresponding  to  the  largest  a 
posteriori  class  probability  /’>v|s(C(2.  |  X)  in  (2.8f  Mathematically, 

^(X)Bayes-DigferMiat  —  {/i(X),...  ..^(X)} 
s.l.  Vi. 


sign  /(X)  -  max^(X)  =  sign  P  ■\.'\Ui\X)  -  max  P>v|,(Ct2* | X) 


Awi.(Ct;,|x) 
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We  denote  the  set  of  all  differential  forms  of  the  BDF  by  V Bm  ff.Difftrrmiai 

Definition  2.7  The  discriminant  difierential:  We  refer  to  Si(X)  in  (2.17)  as  the,  “ith  discriminant 
differential '  ’  —  that  is,  the  difference  betH'een  the  ith  discriminant  function  and  the  largest  other  discriminant 
fimction.  We  use  the  notation 

(S,(X|0)  =  ;?,(X10)  -  max  /;t(X10)  (2.18) 

k^t 

when  referring  to  the  ith  discriminant  differential  of  the  the  classifier  with  the  discriminator  Q{\\0)  = 
{gi(X|0),  ...  ,gt*(X|0)}.  In  this  context,  the  discriminant  differential  is  the  difference  between  the 
classifier's  ith  output  and  the  largest  other  output. 

Remark:  Note  that  ^{X.)Bmex-Di/femmai  accurately  reflects  the  ranking  of  only  the  largest  a  posteriori  class 
probability  in  the  rankings  of  its  discriminant  functions  and  their  associated  discriminant  differentials.  We 
reiterate  that  the  only  condition  necessary  for  the  discriminant  function  to  be  a  differential  form  of  the  BDF 
is  that  its  top-ranked  member  function  /.(X)  always  correspond  to  the  largest  a  posteriori  class  probability 
P»v|x((^- 1 X)  in  (2.8).  This  is  precisely  the  necessary  and  sufficient  condition  for  Bayesian  discrimination 
given  in  (2.9),  which  leads  us  to  the  following  theorem; 

Theorem  2.1  The  differential  form  of  the  Bayesian  discriminant  fimction  ^{X)Bayes  Differeniiai  is  the  most 
general;  the  set  of  all  such  forms  F B,ne.i.Difffreniiat  is  equal  to  the  set  of  all  Bayesian  discriminant  functions 
Fsayri  by  the  following  relationship; 


^(X)Bayes-Slnclty  Prnhahilisiic  G  FBans-StriclIy  Differrniial  C.  F Bayts-Prnhahitistic  C  FBayn-Differrnrial  —  F Bayes  (2.19) 


Proof  :  The  proof  follows  from  definitions  2.6  —  2. 1 1 


Example  2.4  In  order  to  clarify  the  notion  of  a  differential  form  of  the  BDF,  let  us  return  to  the  three-class 
random  variable  x  depicted  in  figures  2.1  and  2.3,  which  we  replicate  in  figure  2.4. 

On  the  left  side  of  the  figure  we  shown  the  a  posteriori  class  probabilities  of  x ;  superimposed  on  these 
are  the  discriminant  functions 


/i(x)  =  -0.0376X  -b  0.3985 
Mx)  =  0.5 

/i(x)  =  0.0376  X  +  0.3985 


(2.20) 
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which  comprise  ^(x) .  Note  that  /|(— 2.7)  =  fii— 2.7)  =  0.5  and  /2(2.7)  =  fi{2.7)  =  0.5,  so  that  the 
discriminant  differentials  shown  in  figure  2.4  (right)  are  given  by 

=  fi{x)  -  max/t(jr) 

k^l 


( 

I 

{ 

{ 

1 


fdx)-f2(x). 

Jt  <  2.7 

Mx)  -Mx), 

otherwise 

-0.0376jt  -  0.1015, 

X  <  2.7 

-0.0752  jt, 

otherwise 

Mx)  -Mx), 

X  <  0 

Mx)  -  Mx), 

otherwise 

0.0376X  +  0.1015, 

X  <  0 

-0.0376JC  +  0.1015, 

otherwise 

Mx)  -  Mx), 

X  >  -2.7 

Mx)  -  Mx), 

otherwise 

0.0376X  -  0.1015, 

X  >  -2.7 

0.0752  Jt, 

otherwise 

(2.21) 


Note  that  the  largest  discriminant  function  and,  as  a  result,  the  positive  discriminant  differential  always 
correspond  to  the  largest  a  posteriori  class  probability  of  x.  For  this  reason,  (2.17)  is  satisfied  and 
J^(jr)  =  ^(x)s(nTi-OiV«mirHit- By  comparing  the  characteristics  of  against  definitions  2.3  —  2.6,  we 
find  that  J~{x)  does  not  satisfy  the  conditions  for  any  other  form  of  the  BDF. 


2.23  Learning  Paradigms  for  the  Bayesian  Discriminant  Function 

We  use  the  notation  todenotethe  ^  example  of  the  random  vector  X;  likewise,  we  use  the  notation 

to  denote  the  class  label  of  that  example.  Supervised  learning  is  the  process  by  which  die  n  example/empirical 
class  label  pairs  {{X',)V'), ...  ,(X",>V”)}  of  die  rrauiingsoinp/e^  are  used  to  adjust  the  parameters  of 
the  classifier  so  that  its  labeling  of  the  training  examples  matches  the  actual  class  labels  as  closely  as  possible. 
As  the  training  sample  size  increases  towards  infinity,  the  empirical  a  posteriori  class  probabilities  of  the 

tnining  sample  is  the  set  of  all  pain  { (X  ' ,  >V ') . (X  " ,  W  *)}  used  to  train  the  classifier.  The  test  sample  is  used  to 

assess  the  classifier's  discrimination,  and  comprise',  all  pain  not  in  the  training  sample. 
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-2.7  0  2.7  X 


-2.7  0  2.7  X 


'* 


Biyg 


Bivet 


Figure  2.4:  Left:  The  a  posteriori  class  probabilities  Pw|x(^t  |  x)  of  the  three-class  random  variable  depicted 
in  figures  2.1  and  2.3,  with  a  differential  form  of  the  Bayesian  discriminant  function 
superimposed.  Right:  The  a  posteriori  class  differentials  Ayvfx((^t|x)  of  the  same  three-class  random 
variable,  with  the  discriminant  differentials  of  superimposed.  Note  that  where  the  rth 

discriminant  differential  df(X)  is  positive,  CJ,-  is  the  Bayes-optimal  class  label  for  jr . 


feature  vector  converge  to  their  true  underlying  values.  For  this  reason,  the  classifier  possessing  sufficient 
functional  complexity^  can  learn  to  approximate  the  BDF  in  at  least  one  of  the  four  forms  described  in  the 
previous  section. 

There  are  two  fundamental  learning  strategies  for  statistical  pattern  recognition.  Classifiers  that  employ 
probabilistic  learning  learn  to  approximate  the  probabilistic  form  of  the  BDF.  That  is,  they  learn  to  estimate 
the  a  posteriori  probabilities  of  the  C  classes  over  all  of  feature  vector  space  —  as  exemplified  by  the 
gray  shaded  functions  in  figure  2.4  (left).  The  estimation  is  done  either  directly  or  by  estimating  tiw 
class-conditional  pdfs  and  class  prior  probabilities,  from  which  the  a  posteriori  class  probabilities  can 
be  computed.  Classifiers  that  employ  discriminative  learning  do  not  learn  to  estimate  probabilities;  they 
merely  learn  to  estimate  the  identity  of  the  Bayes-optimal  class  over  all  of  feature  vector  space.  In  effect, 
discriminative  learning  focuses  on  partitioning  X  *long  the  class  boundaries,  thereby  identifying  regions  on 
X  inside  which  all  patterns  represent  a  single  class  in  the  Bayes-optimal  sense.  This  learning  is  done  without 
explicitly  estimating  the  a  posteriori  probabilities  of  each  class  over  X :  instead  it  focuses  on  estimating  a 

*A  fbnml  deftnilion  of  fonctiofial  complexity  it  not  enential  to  Ihit  chapter,  it  it  tufTicient  to  Mate  that  there  is  tome  upper  bound 
on  the  intricacy  of  the  ditcriminator  ^(X|d)  that  a  clattifier  with  limited  hmctiona]  complexity  can  implement. 
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Paradigm 

Differentiable  Supervised 
Classifier? 

Learning  Strategy 

Parametric? 

Linear  Classifiers 

Rosenblatt’s  perceptron 

yes 

discriminative"’* 

no 

Widrow-Hoff  (i.e., 

LMS/MSE-generated) 

variants 

yes 

probabilistic" 

no 

Ho-Kashyap 

yes 

probabilistic/ 

discriminative"’*’" 

no 

logistic  regression 

yes 

probabilistic" 

yes 

Non-Linear  Classifiers 

k  nearest  neighbors 

no 

probabilistic 

no 

Parzen  windows 

no 

probabilistic 

no 

radial  basis  functions 

yes 

probabilistic" 

no 

multi-layer  peiceptrons 

yes 

probabilistic" 

no 

decision  trees 

no 

discriminative 

no 

LVQ2 

no 

discriminative 

no 

Table  2.1:  Some  well-known  classification  paradigms  and  the  learning  strategies  they  employ.  The 
differentiable  supervised  classifier  is  described  in  definition  2.8. 

*Can  learn  difrerentially. 

^Goanmieed  to  be  ininiinum.eiTor  only  in  the  case  that  the  training  sample  is  linearly  separable. 

'‘Fimdanienlally  probabilistic,  but  discriminative  if  the  training  examples  are  linearly  separable. 

differential  form  of  the  BDF  —  as  exemplified  by  the  dashed  lined  functions  in  figure  2.4.  By  definitions  2.5 
and  2.6,  Ms  is  equivalent  to  learning  y~{'X.)BaKts-SiTkityafftmiUi  to  at  least  one  (sign)  bit  precision  over  “X,  • 
Differential  learning  is  discriminative  learning  in  which  the  optimal  parameters  of  the  classifier  are 
determined  by  a  search  on  parameter  space  0  aimed  at  optimizing  a  differentiable  objective  function.  Three 
definitions  are  relevant  at  this  point: 

Deflnition  IS  The  differentiable  supervised  classifier:  This  classifier  is  one  that  forms  an  input- 
Uhouqtut  mapping  by  adjusting  a  set  of  internal  parameters  via  an  iterative  search  aimed  at  optimizing  a 
differentiable  objective  function  (or  empirical  risk  measure).  The  objective  function  is  a  metric  that  evaluates 
how  well  the  classifier's  evolving  mapping  reflects  the  empirical  relationship  between  the  input  patterns 
of  the  training  sample  and  their  class  membership,  modeled  by  the  classifier’s  outputs.  Each  one  of  the 
classifier's  discriminant  junctions  gf(X  |  d)  must  be  a  differentiable  function  of  its  parameters  0 . 
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Deflnition  2.9  Probabilistic  learning  Ap :  Any  classifier  that  learns  a  probabilistic  form  of  the  Bayesian 
discriminant  function  ^{Sfeaves  Pn^ahUistic  (definition  2.4}  employs  probabilistic  learning.  We  use  the 
notation  Ap  to  denote  probabilistic  learning.  If  the  classifier  is  a  differentiable  supervised  classifier  (defined 
above),  it  implements  probabilistic  learning  through  the  use  of  an  error  measure  objective  function  (see 
sections  2.2.4  and  2.3). 

Definition  2.10  Differential  learning  Aa  :  This  is  discriminative  leanting  performed  by  a  differentiable 
supervised  classifier  (defined  above)  that  employs  the  classification  figure-of -merit  (CFM)  objective  function 
(see  [551,  sections  2.2.4  and  2.4,  and  appendix  D).  A  classifier  that  employs  differential  learning  learns 
the  differential  form  of  the  Bayesian  discriminant  function  ^{')()Ba\ts.Differentiai  (definition  2.6).  We  use  the 
notation  Aa  to  denote  differential  learning. 

Table  2.1  lists  a  few  well-known  classifier  paradigms  and  the  learning  strategies  they  employ;  it 
emphasizes  that  alt  classifier  paradigms  can  be  associated  with  either  the  probabilistic  or  the  discriminative 
learning  strategy.  Classifiers  characterized  as  "linear"  ate  those  that  form  (piece-wise)  linear  decision 
boundaries  on  ^ ;  those  characterized  as  "non-linear"  form  potentially  non-linear  decision  boundaries  on 
\ .  Rosenblatt’s  perceptron,  the  Widrow-Hoff  linear  classifier,  and  the  Ho-Kashyap  linear  classifier  all  have 
linear  discriminant  functions  g/(X  1 0) .  Reference  [29]  describes  each  of  these  three  classifiers  in  detail;  they 
differ  only  by  the  manner  in  which  they  learn.  Specifically,  each  uses  a  different  objective  function  to  search 
iteratively  for  optimal  parameters.  Thus,  they  constitute  differentiable  supervised  classifiers.  Rosenblatt’s 
perception  criterion  function  [116,  29]  seeks  only  to  classify  X  correctly,  not  to  estimate  the  a  posteriori 
probabilities;  as  a  result,  it  is  a  discriminative  learning  procedure  (see  appendix  E).  The  Widrow-Hoff  and 
Ho-Kashyap  variants  both  minimize  a  mean-squared  error  objective  function:  the  Ho-Kashyap  model  adds 
a  constraint  to  the  MSE-minimization  procedure  that  guarantees  class  separation  of  the  training  sample  if 
it  is  indeed  linearly  separable.  As  we  shall  see  in  section  2.3.2,  minimizing  an  MSE  objective  function  is 
equivalent  to  approximating  the  a  posteriori  class  probabilities  of  X  —  a  probabilistic  learning  strategy. 

The  logistic  regression  model  replaces  the  preceding  linear  discriminant  functions  with  logistic  discrim¬ 
inant  functions  g,(X|»)  =  [l  -|-exp(-(0o  +  Eti  '  («  g-.  [91.  ch.  8]  [68]).  Although  the 

decision  boundaries  on  X  remain  (piece-wise)  linear,  the  logistic  discriminant  function  is  a  better  choice  for 
approximating  the  a  posteriori  class  probabilities  of  X  —  particularly  when  the  class-conditional  pdfs  of  X 
are  Gaussian  (sec  appendix  F).  The  logistic  regression  model  leams  by  the  method  of  maximum-likelihood, 
but  the  logistic  non-linearity  makes  closed-form  computation  of  the  optimal  parameters  impossible.  For  this 
reason,  logistic  regression  takes  the  form  of  an  iterative  learning  procedure  in  which  the  Kullback-Leibler 
information  distance  ([82,  8 1  ]  —  see  section  2.3.2)  between  the  discriminant  function(s)  and  the  empir¬ 
ical  a  posteriori  class  probabilities  of  X  (manifest  in  the  training  sample’s  statistics)  is  minimized  (see 
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appciidix  F).  Since  the  logistic  discriminant  function  it  incorporates  is  identical  to  the  one  employed  in 
multi-layer  perceptron  (MLP)  classifiers  [  1 20],  the  logistic  regression  model  can  be  viewed  as  the  MLP  in  its 
simplest  form.  Both  the  logistic  regression  paradigm  and  MLPs  are  probabilistically-generated  differentiable 
supervised  classifiers. 

The  k  nearest  neighbors  algorithm  estimates  the  class  of  a  test  example  by  comparing  it  with  the  most 
likely  class  of  the  k  nearest  training  examples.  The  likelihood  of  each  class  is  estimated  by  its  relative 
frequency  among  the  k  nearest  training  examples.  As  the  training  sample  size  grows  large,  these  relative 
frequencies  converge  to  the  true  a  posteriori  class  probabilities  of  X ,  so  the  k  nearest  neighbors  paradigm 
learns  probabilistically.  The  lack  of  an  objective  function-directed  learning  procedure  disqualifies  it  as  a 
differentiable  supervised  classifier. 

Parzen  windows  attempt  to  estimate  the  class-conditional  densities  of  X  via  a  linear  superposition 
of  window  functions  —  one  for  each  training  sample.  The  specific  form  of  the  window  function  is  not 
particularly  important,  as  long  as  it  is  unimodal  and  has  a  unit  area  under  its  curve.  The  volume  of  X 
that  the  window  covers  is  variable.  As  the  training  sample  size  grows  large,  the  linear  superposition  of 
window  functions  for  a  given  class  converges  to  the  true  class-conditional  density  of  X  [29,  sec.  4.3],  so 
the  paradigm  constitutes  a  form  of  probabilistic  learning.  Again,  the  lack  of  an  objective  function-directed 
learning  procedure  disqualifies  it  as  a  differentiable  supervised  classifier. 

Radial  basis  function  neural  networks  (RBFs)  (e.g.,  ( 1 8, 95, 1 04, 92])  are  —  like  MLPs  mentioned  earlier 
—  discriminant  functions  formed  by  cascaded  layers  of  non-linear  basis  functions.  In  the  case  of  MLPs,  the 
basis  function  is  a  logistic  one  that  forms  linear  decision  surfaces;  in  the  case  of  RBFs,  the  basis  function 
is  most  commonly  a  Gaussian  one  that  forms  radial  (i.e.,  hyperelliptical)  decision  surfaces.  Beyond  these 
differences  the  models  are  quite  similar.  Both  are  differentiable  supervised  classifiers  that  typically  employ 
probabilistic  learning. 

Decision  trees  are  discriminative  classifiers  that  form  linear  decision  surfaces  on  ^  ^  <>f  thresholds; 

these  thresholds  are  expressed  as  a  set  of  rules  associated  with  each  class  of  X ;  the  rules  are  expressed  in 
disjunctive  normal  form  (DNF)’.  There  are  numerous  methods  by  which  the  rules  for  dividing  ^  into  class 
regions  are  induced  (see,  for  example,  [139,  ch.  5]).  The  details  of  rule  induction  are  not  important  for  our 
purpose,  which  is  merely  to  point  out  that  the  process  is  fundamentally  discriminative.  Because  the  rules  of 
the  DNF  are,  in  effect,  step  functions  on  X  •  non-differentiable,  and  the  resulting  classifier  does  not 

satisfy  the  requirements  for  a  differentiable  supervised  classifier. 

The  general  family  of  vector  quantization  (VQ)  classifiers  contains  paradigms  that  are  probabilistic  as 
well  as  discriminative.  The  k  nearest  neighbors  paradigm  described  earlier  can  be  viewed  as  a  probabilistic 
VQ  classifier.  Discriminative  variants  also  exist.  The  most  notable  is  the  LVQ2  paradigm  [76],  which 

’See  (139,  pg.  1 14J  for  a  simple  definition  of  the  DNF. 

'*'See  (891  for  an  extensive  summary  of  vector  quantization  techniques  through  the  early  I980's. 
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CFM  KM 


Discriminalor 

Outpw  Slate  Target  Vector 


Figure  2.5:  A  diagrammatic  comparison  of  er¬ 
ror  measure  (EM)  and  classification  figure-of-merit 
(CFM)  objective  functions  (darker  outputs  have 
larger  values  than  lighter  ones).  EMs  attempt  to 
match  the  discriminator  output  state  (left)  with  a 
target  vector  (right);  CFM  does  not.  Instead  it  uses 
the  target  vector  merely  to  identify  the  discrimina¬ 
tor  output  yr  corresponding  to  Ae  class  label  of 
the  classifier's  input  example.  CFM  then  seeks  to 
maximize  a  function  of  the  difference  (or  discrim¬ 
inant  deferential)  Sr  between  this  output  and  the 
largest  orAer  output  (in  this  case,  yi ).  llie  function 
<T  [<y,  (/’]  's  shown  in  figure  2.6. 


6 


Figure  2.6:  A  synthetic  asymmetric  sigmoidal  form 
of  the  classification  figure-of-merit  (CFM)  [55] 
shown  for  discriminant  differential  values  on  the 
interval  —  1  <  d  <  I  ■  The  shape  of  the  sigmoid 
is  controlled  by  a  confidence  measure  t/’  on  (0,1]: 
<T  [6 , 0]  is  shown  for  eight  different  values  of  tj' . 
The  differentiable  function  of  S  is  nearly  linear  for 
a  confidence  measure  of  unity.  In  the  limit  that  tp 
is  zero,  the  function  becomes  a  Heaviside  step.  The 
synthetic  function  and  its  first  derivative  are  easily 
computed  (see  appendix  D). 


associates  a  number  of  prototypical  vectors  with  each  class.  The  vectors  are  initially  determined  by  k-means 
clustering  (e.g.,  see  [89]).  Learning  is  then  performed  by  iteratively  perturbing  the  vector  locations  on  X 
with  the  goal  of  minimizing  the  number  of  misclassified  training  examples.  The  learning  strategy  is  therefore 
discriminative;  the  resulting  classifier  does  not  satisfy  the  requirements  for  a  differentiable  supervised 
classifier. 

Table  2. 1  lists  only  a  few  classifiers.  Countless  others  exist,  but  each  is  either  a  differentiable  supervised 
classifier  or  it  is  not  The  remainder  of  this  chapter  —  and  this  text  —  deals  with  those  that  are,  since  all 
such  classifiers  can  employ  both  probabilistic  and  differential  learning. 

2JL4  The  Link  Between  Objective  Funetkm  and  Learning  Strategy 

In  section  2.3  we  prove  that  differentiable  supervised  classifiers  generated  with  a  broad  family  of  error 
measures  (EMs)  learn  probabilistically;  that  is,  they  learn  a  probabilistic  form  of  the  BDF  described  by 
definitions  2.3  and  2.4.  In  section  2.4  we  prove  that  these  same  classifiers  learn  differentially  when  generated 
with  the  CFM  objective  function  [55];  that  is,  they  team  a  differential  form  of  the  BDF  described  by 
definition  2.6. 

Figure  2.5  compares  error  measures  to  the  CFM  objective  function  diagrammatically.  Error  measures 
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such  as  mean-squared  error  (MSE)  and  the  KuHback-Leibler  information  distance  [82,  81]  —  known  as 
“cross  entropy”  (CE)  in  the  neural  network  literature  (e.g.,  [64])  — compare  the  classifier’s  discriminator 
output  state  with  a  target  vector  T .  Given  a  training  example/empirical  class  label  pair  (X^ ,  =  UJt)  , 

the  rth  element  of  T  (Tr)  is  typically  unity,  indicating  that  the  empirical  class  label  of  the  example  is 
(jJr  ■  All  the  other  elements  of  T  are  typically  zero."  The  right-hand  side  of  figure  2.5  illustrates  how  the 
classifier  learns  via  an  error  measure;  it  alters  its  parameters  in  order  to  match  the  discriminator’s  output  state 
with  the  training  example’s  target  vector.  This  is  done  by  minimizing  the  error  measure  (EM)  between  these 
two  vectors.  The  arrows  superimposed  on  the  gray  shading  between  the  discriminator  output  state  and  the 
target  vector  symbolize  the  process,  which  is  iteratively  repeated  for  all  examples  in  the  training  sample  until 
the  average  error  measure  converges  to  a  small  value. 

Unlike  its  EM  counterparts,  the  CFM  objective  function  has  no  target  values;  this  is  because  it  is  not 
an  error  measure.  The  left-hand  side  of  figure  2.S  illustrates  how  the  classifier  learns  via  CFM.  It  alters  its 
discriminator’s  parameters  in  order  to  maximize  the  discriminant  differential  6t  between  1 )  the  output  yr 
corresponding  to  the  class  label  ixJr  of  the  example,  and  2)  the  largest  or/icr  output  )v  (note  that  yv  =  yi 
in  the  figure);'^ 


c  A  — 

-  yr  -  yr-, 

=  (jJr  fr  =  max  yt 


(2.22) 


Notational  convention  for  the  dbcriminant  differential:  We  generally  omit  the  subscript  r  when 
referring  to  the  discriminant  differential.  Absent  a  subscript,  the  notation  6  always  implies  Sr- 


The  single  curved  arrow  superimposed  on  the  gray  shading  to  the  left  of  the  discriminator  output  state  in 
figure  2.5  symbolizes  the  computation  of  Sr ,  which  is  maximized  by  maximizing  the  measure  (7  [^r .  V’]  ■ 
The  maximization  is  iteratively  repeated  for  all  examples  in  the  training  sample  until  the  average  CFM 
converges  to  a  large  value.  Note  that  the  target  vector  T  is  used  only  to  identify  yr  ■ 

Definition  2.11  The  CFM  objective  ftinction'^:  The  CFM  objective  function  for  a  given  example  is  a 
strictly  increasing  function  (J  of  the  discriminant  differential  S  corresponding  to  the  empirical  class 

label  of  the  training  example.  The  function  must  have  a  sigmoidal  form  that  spans  the  continuum  between  a 
linear  function  of  6  and  a  step  function  of  S .  The  maximum  steepness  of  the  sigmoid  is  regulated  by  the 
' '  T  need  not  be  binary.  The  following  proofa  allow  for  non-binary  target  vectors. 

'^It  is  itnpoftam  to  note  that  the  identity  of  the  largest  other  output  >7’  in  6  r  is  stochastic:  it  not  only  varies  across  examples,  it  may 
also  change  for  a  given  example  as  learning  progresses  (SS). 

'-'Throughout  this  text  we  refer  to  the  form  of  CFM  that  involves  the  computation  and  use  of  one  and  only  one  discriminant  differential. 
This  is  the  form  of  CFM  originally  described  as  “N-monotonic  CFM”:  see  (55.  pg.  226]. 
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confidence  parameter  tj' .  The  specific  functional  form  of  (J  (1  is  not  important,  as  long  as  it  satisfies  the 
following  sigmoidal  constraints: 

•  The  function  must  have  finite  lower  and  upper  bounds  I  and  h: 

—  oo  «:  /  <  <7  [<5.  l('’]  <  h  oo  (2.23) 


The  function  must  be  be  a  strictly  non-decreasing  sigmoidal  function  of  S : 

\ 

for  small  15| 
otherwise 


>  0. 


>  0. 


(2.24) 


•  The  fitnction  must  have  a  maximum  slope  occurring  in  its  transition  region.  This  transition  slope  must 
be  inversely  proportional  to  the  confidence  parameter  V’  • 

max  oc  (/>'',  V  G  (0.1)  (2.25) 

S 

This  proportionality  requirement  ensures  that  learning  is  reasonably  fast  (see  section  D.3). 

Remark:  We  use  an  asymmetric  sigmoidal  function  for  (7  [•) ,  which  satisfies  the  constraints  above  as  well 
as  others  imposed  by  theorem  2.2  of  section  2.4  and  chapter  5;  it  has  lower  and  upper  bounds  of  (  =  0  and 
h  =  \ .  This  function  is  illustrated  in  figure  2.6;  it  is  expressed  by  a  computationally  efficient  mathematical 
form  (see  appendix  D).  The  original  sigmoidal  form  of  the  CFM  objective  function  is  given  in  [55]'^,  but 
the  synthetic  form  described  herein  has  a  number  of  advantages  relating  to  its  computational  efficiency  and 
the  differential  learning  rates  it  engenders  (see  appendix  D  and  chapter  S).  Figure  2.6  IllusUates  the  synthetic 
function  on  the  interval  —  I  <  <$  <  I  for  eight  different  confidence  values  xj.' .  Note  that  this  synthetic 
function  is  approximately  linear  in  S  for  =  I  and  it  is  a  step  function  of  5  for  tj'  0*  : 

r  1,  S>0  (2.26) 

^  0,  otherwise 

The  parameter  tp  is  described  further  in  section  2.4  and  appendix  D. 

If  the  number  of  classes  C  is  two,  the  discriminator  Q{X\9)  is  linear,  and  the  classifier  learns 
differentially  via  the  CFM  objective  function,  the  resulting  paradigm  is  quite  similar  to  Rosenblatt's 


'^In  this  rereretloe  the  parameter  fi  is  proportional  to  I  /  V'  - 
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percepiron  [116].  Indeed,  appendix  L  shows  Jha(  one  can  view  differential  learning  via  CFM  as  a 
generalization  of  the  two-class  perccptron  approach  to  discriminative  learning.  The  generalization  is  such 
that 

•  there  is  no  restriction  on  the  number  of  classes  C , 

•  there  is  no  restriction  on  the  functional  form  ofthe  discriminator  (except  that  ^(X|0)  be  differentiable 
on  parameter  space  0 ), 


the  learning  procedure  involves  an  iterative  search  on  0  aimed  at  maximizing  a  differentiable  CFM 
objective  function  df  [-] . 


23  Probabilistic  Learning  A/> 


®  ITie  differentiable  supervised  classifier  learns  probabilistically  by  adjusting  its  p<?ramelers  to  minimize  an 

error  measure  over  the  training  sample.  We  assume  that  a  number  of  favorable  conditions  exist  prior  to 
learning,  in  order  to  be  sure  that  the  classifier  learns  a  probabilistic  form  of  the  BDF  7^['X.)Bayts  Prohahiii!aic  € 
VB<nts.pwhahiii!<tic  (sec  definitions  2.3  and  2.4): 


•  We  assume  that  we  have  access  to  an  unlimited  number  of  training  examples,  so  that  we  have  sufficient 

data  to  learn  precisely. 

•  We  assume  that  the  discriminator  Q{X\9)  has  sufficient  functional  complexity'^  to  learn 
^{X)Ba\e.iPrnhahiiitiic  precisely.  Specifically,  we  assume  that  the  classifier’s  parameter  space  0 
contains  at  least  one  point  d‘  that  both  minimizes  the  error  measure  (EM)  over  the  training  set  and 
satisfies  the  constraint  0(X  1 0*)  €  ^Bayn-ProhMiimc- 

#  •  We  assume  that  the  algorithm  we  use  to  search  for  the  parameters  9"  is  guaranteed  to  find  O' ,  given 

sufficient  time  and  computational  resources. 


In  short,  we  assume  that  ^(X)Bms  PnM>iiiiiiic  is  leamable  to  the  extent  that  we  have  sufficient  (possibly 
infinite)  information,  computational,  and  temporal  resources  to  learn  it.  We  are  not  yet  concerned  with  the 
efficiency  of  the  learning  procedure;  we  are  merely  concerned  that  it  does  the  right  thing,  given  enough 
resources.  We  address  the  issue  of  leamability  in  more  realistic  terms  in  the  chapters  that  follow. 


’’Barak  Pearliminer  has  made  important  contnbutions  to  (be  maieriai  in  (his  section.  We  note  in  particular  his  original  formulation  of 
I)  the  general  error  measure  (the  present  formulation  is  a  minor  extension),  and  2)  the  strictly  probabilistic  constraint  of  (2. 45). 

'*Again,  we  eschew  a  formal  definition  of  functional  complexity  in  this  chapter,  stating  merely  that  there  is  some  upper  bound  on  the 
hMrican  of  the  discriminant  functions  that  a  classifier  with  limited  fimctional  complexity  can  innplement.  We  assume  that  the  classifier 
has  sufficient  complexity  to  learn  the  probabilistic  form  of  the  BDF.  While  the  proofs  of  this  chapter  are  not  limited  to  neural  network 
classifiers  per  se,  we  note  the  proofs  iIm(  feed-forward  neural  networks  can  leam  aibitrarily  complex  mappings  (24, 143]. 
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2.3.1  The  General  Error  Measure 

Given  a  (raining  example/class  label  pair  (X^ ,  ,  the  target  vector  T  for  the  discriminator  outputs  Y 

has  elements  that  can  assume  one  of  two  states.  The  target  vector  clement  corresponding  to  the  empirical 
class  of  the  (raining  example  is  set  to  the  high  slate  D,  and  all  the  other  elements  are  set  to  the  low  state  ->D 
(read,  “not  D”): 


T  =  {Tt,...,Tc)-,  re  hD,Dy’ 

{D,  Wi  =  UJi 
->D ,  otherwise 

D  e  W;  -.D  e  R;  <  D 


(221) 


We  require  a  particular  kind  of  symmetry  in  (he  general  error  measure  ([  •  ] :  a  discriminator  output  y, 
that  is  higher  than  its  low-state  target  ->D  by  the  amount  6  must  generate  the  same  error  as  a  discriminator 
output  yt  that  is  lower  than  its  high-state  target  D  by  6 .  This  symmetry  constraint  reinforces  our  intuitive 
notion  that  the  error  measure  should  penalize  all  missed  targets  in  a  consistent  manner.  Mathematically, 


^[y,  =  -,D  -I-  e,Ti  =  -.D]  =  =  O  -  e.T*  =  D]  Ve  €  «  (2.28) 

The  general  measure  of  error  between  the  rth  discriminator  output  and  its  target  is  therefore  given  by 


C(g,(X^|e),r,((X>,VW>))l 


r  /(r,((X/,W>))  -  g,(X^le)),  r,((X>.>V^))  =  D 

\  f(gi{Xi\9)  -  r,«X>.>V/))). 


j  f{D  -  gi(XJ\9)}, 

\  f{gi(Xf\0) - <D),  otherwise 


(2.29) 


The  function  /( • )  in  (2.29)  is  positive  definite,  with  a  unique  minimum  occurring  when  its  argument  is  zero: 

/(«  =  0)  <  /(«  #  0)  &  ^/(m)  >0  (2.30) 

Equations  (2.28)  —  (2.30)  are  minor  variants  of  the  expressions  in  (54,  sec.  3.1].  Miller,  Goodman,  and 
Sm3rth  have  derived  similar  expressions  independently  in  [93]. 
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The  total  error  generated  by  a  training  sample  S"  is  the  sum  of  the  C  error  terms  in  (2.29),  over  all 
discriminator  outputs,  for  each  example: 


c 

EM{5''|»)  = 

1=1 

(2.31) 

J=i 

Up  to  this  point  we  have  used  the  notation  (X^ ,  W^)  to  denote  an  example  X ^  of  X  and  its  associated 
class  label  .  Now  we  introduce  the  notation  X^  to  denote  an  example  of  X  having  the  specific  value 
Xp  — a  unique  pattern  or  prototype  (terms  that  we  use  synonymously)  of  X  (i.e.,  a  particular  point  on  ^ 
identified  by  the  subscript  p ).  No  two  prototypes  represent  the  same  point  on  X  {Xa  =  Xh  iff  a  =  b), 
and  there  is  no  restriction  on  the  number  of  prototypes.'^  We  denote  the  class  label  associated  with  X^  by 
Vty ,  and  we  denote  the  resulting  example/class  label  pair  by  (X^ ,  Wy ) .  Using  this  notation,  we  can  re-state 
e,-  in  (2.31)  as 


I 

nEE  ^(«.(X/(^).r,((X/.H'/))l  (2.32) 

ir=i  j=l 

where  P  denotes  the  total  number  of  unique  patterns  and  np  denotes  the  number  of  examples  of  the  pattern 
Xp  among  the  n  training  examples.  Thus,  J^p-f  ttp  =  n.  If  we  use  np^i  to  denote  the  number  of  examples 

of  the  pattern  Xp  having  the  class  label  U!/,  such  that  np,f  =  ttp,  (2.29)  can  be  used  to  simplify 
(2.32),  replacing  the  notion  of  examples  with  the  more  general  notion  of  prototypes: 


[np,i  -fiD  -  g,(Xp|0))  -I-  {np  -  np,i)  ■  f{gi{Xp\0)  -  -.D)] 


(2.33) 


Thus 


EMtS"!#)  = 

1=1  p=l 


L  "P 


./(D  -  g,(Xp|e))  -b  -/(gtiXpie) 


(2.34) 


As  the  training  sample  size  n  grows  asymptotically  large,  the  empirical  frequencies  converge  to  their 
underlying  probabilities.'^  Thus, 


'^Indeed,  we  Htume  the  number  of  prototypes  to  be  infinite  for  the  uncoumable  feature  vector  space. 

'*See  appendix  B.  Note  also  that  we  are  not  yet  concerned  with  die  specific  rate  at  which  this  convergence  takes  place. 
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c  r 

EM(5''|0)  =  Px(Xp)  [Pw\ic{Uli\X,)  ■  f{D  -  ^,(X,|0)) 

/=  I  /»=  t 


+  Pu|x(-’tt;.|Xp)  -ngiiXpie)  -  -.D)]  (2.35) 


where 


Pn-ix(-'OJi\x,)  =  j  -  Pvv,x(a;,|x,) 


(2.36) 


Simultaneously,  the  number  of  patterns  P  grows  asymptotically  large,  such  that  EM  (S"  1 0)  converges  to 
the  expected  value  of  the  error  measure  over  ^ ,  which  we  denote  by  Ej^  [EM(X  |  ^)]  : 


^lim^  EM  (5"  1 0)  =  Ex  [EM(X  | «)] 

P-KX! 

^  /  [r{0-g,(X|(?))-Pw|xM|X) 

+/(«,(X  1 0)  -  -,D)  •  (1  -  Pvv|x(U2, 1 X))]  /5x(X)  dX 

y - , - 

Ex[r.(X)l 


(2.37) 


-L 


\f{D  -  g,{X|e))  •  Pvv|x(a^,  !X) 

+/(g,(X|<»)  -  ^D) .  (I  -  P,v|x(u;/|X))] 


(=1 


EM(X|0) 


Pj^[X)dX  (2.38) 


To  minimize  Ex  [EM(X|9)]  with  respect  to  the  parameters  9  of  the  discriminator  Q{X\9),  we 
solve  for  9'  such  that 


Vg  (Ex  (EM(X|»*)])  =  Ex  [Vg  (EM(X|e-))]  =  0.  (2.39) 

where  (EM(X  |  S*))  denotes  the  gradient  of  EM(X  1 9)  with  respect  to  9,  evaluated  at  9"' ,  and  0 
indicates  the  vector  with  zero  magnitude.  In  order  for  (2.39)  to  hold  for  any  and  all  Px(X)  on  X  (“  » 
trivial  example  Px{X)  =  S{X  —  Xo) ,  where  J(  • )  denotes  the  Dirac  delta  function  [80,  pg.  266]),  it  is 
necessary  that  the  error  measure's  gradient  (EM(X  1 9*))  be  zero  for  all  values  of  X : 
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V.  E 


f{D  -  g;(X|«-))  •  P,vx(a^,lX) 


tr  +/(«,(X|0')  ~^D)  {\  -  p,v|x(a;,lX)) 


=  0  VX  e  X  (2.40) 


EMCXlft*) 


This,  in  (um,  requires  that 


Aa,,(X)  a«,(x)  ag,(x|*-)  w„  ^  ,,,,, 

a«;  -  ^  aSxlT)  “  ''»/  ■  ''x  €  x  «-4i) 

Clearly,  (2.41)  is  satisfied  if  the  ith  discriminant  function’s  derivative  with  respect  to  the  parameter  6i  is 
zero,  but  we  are  interested  in  the  more  general  case  for  which  this  derivative  is  non-zero.  Indeed,  manipulating 
the  parameters  that  affect  the  discriminator’s  functional  mappings  (i.e.,  those  for  which  ^  g,(X  |  0) 

is  the  whole  point  of  learning.  Thus,  if  the  error  in  (2.37)  is  to  be  minimized  independent  of  the  values 
{^«i(X|e), ...  ,  ^gc(X  I  e)},  it  is  necessary  that 


9g,(X) 

^)?.(X(<?-) 


=  [-/(D  -  g.(X|r))  ■  Pw|x(Cc;.  lX) 

-(-/'(g,(x|r)  -  ^D)  ■  (1  -  Pvvix(u;,|X)) 


=  0  Vi.  vx  e  X 


(2.42) 


(where /'(z)  denotes  j^/(z) ).  Rearranging  terms, 

f'(D- gi(X\0‘))  -  Pw\x{(J^i\X)  =  /(g,(X10-)--D)-(l-Pvv|x(a;i|X))  Vi.  VX  e  X  (2-43) 


no  -  j?,(X|g-))  _  (I  -  P>vix(a>.|X)) 


/'(g,(XlO-)  -  -.O) 


Pw|x(f^«  I X) 


Vi,  VX  6  X 


(2.44) 


If  we  add  the  additional  strictly  probabilistic  constraint  on  on  the  functional  form  of  the  error  measure  f(  • ) 


/(D  -  g,{X|fl))  =  /(g,(X|e)  -  -,D)  • 


(I  -  g.(X|g)) 
gdx\0) 


Vi.  VX  €  X 


(2.45) 


36 


Chapter. 2:  Prohahilislic  and  Differential  Learning 


then  (2.42)  becomes 


/(«.(x|r) 


(I  -  g,(X|<?‘)) 


Ph-Ix(C^.IX)  -/(^,(X|0-)  -  -nD)  •  {!  -  Ph.|x(C4;,|X)) 


=  /'{g,(X|0*)  -  -.D) 


P>v|x(^i'  I X) 

«.(X|r) 


I 


=  0  Vi.  vx  e  X 


(2.46) 


Equation  (2.46)  requires  that 


^,(X|r)  =  Pvv|x(t^,|X)  Vi.  VX  e  X  (2-47) 

Thus,  the  differentiable  supervised  classifier  learns  ^{X)Bm„Sfriah  PrnhahiHstic  when  (given  our  favorable 
assumptions)  it  is  generated  with  an  error  measure  satisfying  (2.28)  —  (2.30)  and  the  strictly  probabilistic 
constraint  of  (2.4S). 


23^  Specific  Strictly  Probabilistic  Error  Measures 

One  family  of  error  measures  satisfying  the  strictly  probabilistic  consueint  of  (2.45),  has  the  following 
functional  forms; 


# 


f(u)  =  J  f/ dw 

/(«)  = 

where  r  is  a  positive  integer. 


(2.48) 


(2.49) 


and  we  employ  the  relationship  y,-  =  g,(X  1 0)  to  simplify  our  notation.  Using  (2.49)  and  the  relationships 


s.r.  d[yi - >D)  =  dy, 

(2.50) 

s 

1 

II 

1 

'.f.  d{D  -yj)  =  -dy,  . 

(2.51) 
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we  can  expand  (2.48)  via  iwo  simple  transformations  of  variables 


/(V,  -  -D)  =  /  (V,  -  ^DY(D  -yiV-'dyu 

f(D  -  V,)  =  -  ,/•  [D  -  v,)^  (V,  -  ^Oy-'  dyi 


f'iyi  -  ^D)  =  (v,  -  -^DY(D  -yiY-' 

/'(D-v,)  =  {D  ~y,Y(yi  -  -^Dy-' 


(2.52) 


Substituting  (2.52)  into  (2.44),  we  find  that  minimizing  this  family  of  error  measures  leads  to  the  following 
relationship  between  the  discriminator  outputs  and  their  corresponding  a  posteriori  class  probabilities: 

D  -Vi  _  I  -  P>v|x(t^/|X) 

Pvvix(f^ilX) 

•••  y.  =  Pw|x(ti;,  |X)  •  {D  -  -^D)  +  -D  (2.54) 

Note  that  if  the  low  and  high-state  target  values  are  binary,  the  discriminator  outputs  equal  their  corresponding 
a  posteriori  class  probabilities: 


yi  =  Pw|x{W.|X);  -D  =  0.  D  =  1  (2.55) 

However,  even  if  the  target  values  are  not  binary,  the  discriminator  outputs  remain  linear  functions  of  their 
corresponding  a  posteriori  probabilities.  Since  differentiation  is  a  linear  operation,  any  linear  combination  of 
K  functions  /*{«)  satisfying  (2.28)  —  (2.30),  and  (2.45) 

K 

<!,{„)  =  ^2.56) 

*=i 

will  constitute  a  viable  error  measure.  The  linear  coefficients  cu  must  have  values  that  enforce  the  constraints 
of  (2.30)  on  ^(u),  such  that 


d^ 

<i>{u  =  0)  <  <fi{u  ^  0)  &  >  0 

The  family  of  error  measures  described  by  (2.48)  and  (2.49)  has  two  familiar  members. 


(2.57) 


(r  =  0)Kailback>L«ibier  information  distance:  When  r  =  0  in  (2.48), 


,  r  -  log(D  -  y,) ,  r,  =  -iD 

/(m)  =  /  (-’«)  du  =  -log(-i«)  s.t.  (b’i.f.I  =  <  (2.58) 

■’  (  -log(y, - .D),  r,  =  D 
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When  the  low  and  high-state  target  values  are  binary,  the  error  measure  is  the  Kullback-Leibler  information 
distance  [82,  8 1  ]  —  known  as  “cross  entropy”  (CE)  in  the  connectionist  literature  (e.g.,  [64]); 

(  -log(l  -  y,),  Tj  =  ->D  =  0 

a.v,,r,l  =  ^  (2.59) 

{  -log(.V,),  Tj  =  D  =  \ 

Note  that  the  discriminator’s  output  Y  must  be  bounded  in  order  for  /( • )  in  (2.58)  to  both  exist  and  meet 
the  constraint  imposed  by  (2.30): 

Y  €  y  =  [l,hf  ;  I  >  ^D,  h  <  D,  I  <  h  (2.60) 

V  I  <  ->D  U  h  >  D  above,  then  ->u  can  be  negative,  and  -log(->M)  will  be  undefined.  If 
/  >  -iD,  h  <  D  above,  the  constraints  imposed  by  (2.30)  will  be  violated,  strictly  speaking.  However, 
for  practical  purposes,  setting  low  and  high-state  target  values  that  are  beyond  the  limits  of  the  discriminator 
outputs  when  /(•)  is  given  by  (2.S8)  is  equivalent  to  setting  -<£>  =  /  and  D  =  A .  Thus,  the  Kullback-Leibler 
infonnation  distance  dictates  discriminator  outputs  on  any  subset  of  the  output  space  y  e  [0, 1 .  With  these 
constraints,  the  Kullback-Leibler  infonnation  distance-generated  classifier  learns  J^{X)Ba\yes-siricth  i’mhahiiiaic 
by  the  following  proof: 


CE{5-|0)  =  E"'  =  -EL 

1=1  r=l  p—\ 


•  log(g,(X,|0))  + 

np 


('V  -  ”p.') 


log(l  -  g,(XJ0)) 


(2.61) 


Following  the  derivation  for  the  general  error  measure  in  (2.35)  —  (2.38),  we  obtain  the  expected  value  of  the 
Kullback-Leibler  information  distance  (cross  entropy  —  CE),  which  we  denote  by  E^  [CE(X  ]  ®)] : 


Jim  CE(5"|0)  =  Ex[CE(X|<?)] 

P-fOO 

V'  -  /  Dog{«,(X|<»))  •  Pw|x((^i|X) 

—  E  ^ 

+  logd  -  g,(X|0))  •  (I  -  P>v|x(U^,|X))]  /?x(X)dX 

' - -  -  - - - - - - ' 

E,[r,(X)J 


(2.62) 
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c 

-E  N(«.{x|«))  •  PM,|x(a;,  |x) 

1=1 

_ +  iog(i  -  j;/(X|g))  •  (I  -  p,v:|x(a;,|X))] 

CE<Jlf|fl) 


p^(X)dX 


(2.63) 


To  minimize  [CE(X  |  ©)]  with  respect  to  the  parameters  0  of  the  discriminator  ^(X 1 0) ,  we  soK  i- 

for  0“  such  that  (E^  [CE(X(0')])  =  [V^  (CE(X|0‘))]  =  0.  Following  the  litany  of 

(2.39)  —  (2.42),  we  obtain  the  necessary  condition  for  minimizing  E^  [CE(X  1 0)]  : 


deijX) 

3g/(X|r) 


g,(x|0*)  ■  -  1  -gjxir)  ■  ■  Pvvix(a;,|x)) 

0  V/.  VX  €  X 


(2.64) 


In  order  for  this  equation  to  hold  for  any  and  all  f>x(X)  on  is  necessary  that 


or 


Pw|x(<^t|X)  _  I  -  P>v|x(U2||X) 

g,(x|r)  i-g,(x|r) 

g,(x|0*)  =  p»vix(tt^,|X)  V/.  vx  e  X 


(2.65) 

(2.66) 


Again,  if  the  low  and  high-state  target  values  for  the  Kullback-Leibler  information  distance  are  not  binary 
but  meet  the  constraints  of  (2.60),  the  discriminator  outputs  will  be  a  linear  function  of  their  corresponding  a 
/7o.r/er((7rr  probabilities:  g,(X|fl)  =  Pw|x((*^«  |X)  •  (D  —  -iD)  +  -<D . 


( r  =  1 )  Mean  squared  error:  When  r  =  1  in  (2.48), 

f(u)  =  udu  =  -u^  s.t.  Cb'i.T’.l  =  < 
the  error  measure  is  the  mean-squared  error  (MSE)  objective  function: 


Tj  =  -iD 

r,  =  D 


(2.67) 


=  i(.v,  -  T,f  (2.68) 

MSE  has  the  particularly  nice  property  that  it  satisfies  the  constraints  of  (2.28)  —  (2.30),  and  (2.45)  regardless 
of  the  nature  of  D,  -iD.and  y. 

With  binary  target  values,  the  MSE-generated  classifier  learns  J'{X)Ba\Y.i  Siriciiy  PrcbMumc  by  the  following 
proof: 
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MSE(5'|<>)  = 

i=  \  p—  \ 


i=l 


■  («.(Xp|0)-l)2  + 
Hn 


(«P  -  «p,.) 


(2.69) 

Following  the  derivation  for  the  general  error  measure  in  (2.35)  —  (2.38),  we  obtain  the  expected  value  of  the 
mean-squared  error,  which  we  denote  by  [MSE(X  1 9)] : 


lim  MSE(5"|e)  =  Ex[MSE(X|e)] 


-  i  X-  iv  ■  P>vix(tc;,  |x) 

-t-g,{X|(?)2 .  (I  ~  P»v|x(a^,|X))]  Py,iX)dX 


(2.70) 


(=1 


=  /. 


^  E  [(«.(X|0)-i)'-Pwix(a;.|X) 


i=l 


-|-g,(X|0)2.(|  -  Pw|x(U^,|X))] 


MSEtXIO) 


^x(X)dX 


(2.71) 


To  minimize  E^  [MSE(X|0)]  with  respect  to  the  parameters  0  of  the  discriminator  Q{X\0),  we 
solve  for  e*  such  that  (Ex  [MSE(X  I »*)])  =  Ex  [V^  (MSE{X|e*))]  =  0 .  Following  the 
litany  of  (2.39)  —  (2.42),  we  obtain  the  necessary  condition  for  minimizing  Ex  [MSE(X  |  ©)]  : 


deijX) 

dgi{X\0-) 


(g,(X|e>*)- 1) .  Pw|x(a».|X)  -I-  (g,(x|9-)) .  (1  -  Pw|x(t*;,lX)) 
0  Vi.  VX  €  X 


(2.72) 


In  order  for  this  equation  to  hold  for  any  and  all  />x(X)  on  X  •  necessary  that 

(I  -  gi(X|fl*))  .  P>v|x(U;i|X)  =  g,{X\9‘)  .  (I  -  P>v|x(t«2,|X))  Vi,  VX  €  X  (2-73) 
or 

g,{X|r)  =  Pwix{U^,|X)  Vi.  VX  e  X  (2-74) 
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Again,  if  the  low  and  high-state  target  values  for  MSE  training  are  not  binary,  the  dis¬ 
criminator  outputs  will  be  a  linear  function  of  their  corresponding  a  posteriori  probabilities; 
g,(X|0)  =  P,v|x(W,  |X)  •  (D  -  ^D)  -b  . 


2.3.3  Minkowski-r  Power  Metrics  and  Other  Common  Error  Measures 

The  Minkowski-r  power  metric*^  is  given  by 


^L'v.'r.l  =  -  (bv  -  7-,])" 


(2.75) 


If  we  constrain  discriminator  output  space  as  in  (2.60),^  the  metric’s  functional  form  reduces  to 


/(«)  =  ^t/ 

/(«)  =  «'■' 

Given  u  in  (2.42), 

r,  =  D 

Tj  =  -iD 


1  -  {.V,  -  -^DY , 

'  r 


Substituting  (2.75)  into  (2.44),  we  find 


(2.76) 


(2.77) 


D  -  yi  _  r-  Ijf*  ~  Pw|x(^i|X) 

yi - iD  y  Pw|x(t»2,|X) 


(2.78) 


.V/ 


D  4-  ^PC  . 
I  +  <  ’ 


I  <  r  <  oo 


(I  +  C)  *;  -•P  =  0,  P  =  I ,  l  <  r  <  oo 


(2.79) 


For  binary  target  values,  the  discriminator  outputs  engendered  by  the  limiting  values  of  r  are 


'*Tlie  Minfcowtki-r  power  metric  and  the  Lg  notm  (e.g..  (78.  cb.  4])  are  closely  related,  but  not  identical. 

^It  can  be  shown  that  if  the  constraints  of  (2.60)  ate  violated,  Minkowski-r  power  metrics  can  require  complex  discriminator  outputs 
in  Older  to  satisfy  (2.44). 
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Figure  2.7:  The  discriminator  output’s  minimum-error  value  for  the  Minkowski-r  power  metric 
(  r  =  1.25,2,9;  binary  output  target  values). 


P»V|x(^A^i  1  X)  >  .5 

.5, 

1 X)  =  .5 

i  0- 

Pw|x(<^i|X)  <  .5 

'  1, 

Pw|x(t<^i|X)  =  1 

.5, 

0  <  Pw|x{^<tX)  < 

Pw|x(tt'i  1 X)  =  0 

-,D  =  0,D=1  (2.80) 


Figure  2.7  illustrates  the  relationship  between  the  discriminator  output  and  its  corresponding  a  posteriori 
class  probability  for  three  Minkowski-r  power  metrics  with  binary  target  values  ( r  =  1 .25, 2, 9 ).  The  r  =  2 
case  corresponds  to  the  MSB  error  measure  described  earlier. 


(r  =  1 )  Mean  absolute  erron  Note  that  when  r  =  1  in  (2.75)  —  (2.77),  €[•]  is  the  mean  absolute 
error  (MAE)  measure.^'  Because,  by  (2.80),  this  EM  engenders  binary  discriminator  outputs,  it  is  not  a  wise 
choice  for  pattern  recognition  tasks  in  which  (the  number  of  classes)  C  >  2.  In  such  cases  it  is  possible 
that  all  the  a  posteriori  class  probabilities  are  less  than  0.5  for  some  values  of  X .  At  such  points  on  X  •  ^ 
absolute  error-generated  classifier’s  discriminator  outputs  will  all  be  zero  by  (2.80),  and  it  will  fail  to  identify 
the  Bayes  optimal  class  of  X  —  a  particularly  undesirable  characteristic.  However,  when  C  =  2,  MAE 
has  some  desirable  properties,  which  we  discuss  in  section  5.3. 1  (page  1 27). 

The  mean  abMhite  error  meaiure  if  known  by  many  other  names;  the  most  common  are  least  absolute  cnor  (LAE)  and  least  absolute 
deviation  (LAD.  e.g..  (9]). 
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The  r  =  oo  case  above  engenders  discriminator  outputs  that  are  all  constant  at  0.5  —  resulting  in  a 
useless  classifier.  Values  of  r  on  the  open  interval  (l,oo)  engender  discriminator  outputs  g,(X|6t)  that 
are  strictly  increasing  functions  of  their  corresponding  a  posteriori  class  probabilities  Pwix(^i  I X) .  Thus, 
classifiers  generated  with  Minkowski-r  power  measures  ( 1  <  r  <  oo)  learn  J~(^}Bayes-Prohahiii.siic  of 
definition  2.4;  only  the  r  =  2  form  leads  to  {\)Bay€s-Siriah  ProhahHisUc  of  definition  2.3. 

Other  classes  of  error  measures  exist  (e.g.,  [6, 98, 3 1 , 43, 94]);^^  in  general  they  are  variants  of  well-known 
EMs,  and  can  be  analyzed  via  (2.44)  to  determine  the  discriminator  outputs  they  engender  for  asymptotically 
large  training  sample  sizes.  It  is  intuitively  appealing  that  error  measures  —  with  a  few  notable  exceptions 
—  lead  to  classifiers  that  learn  probabilistically.  On  the  other  hand  it  leads  us  to  question  —  solely  on  the 
basis  of  learning  efficiency  (chapter  3)  —  the  comparative  advantage  of  choosing  one  error  measure  over 
another.  To  be  sure,  the  statistical  pattern  recognition  literature  shows  that  specific  error  measures  lead  to 
more  efficient  learning,  given  parf/cM/or  class-conditional  densities  of  X .  To  date,  however,  we  know  of  no 
single  best-choice  EM  for  the  feature  vector  with  arfc/Vrary  class-conditional  densities. 


2.4  Differential  Learning  Aa 

The  difFerentiable  supervised  classifier  learns  difTerentially  by  adjusting  its  parameters  to  maximize  a 
classification  figure-of-merit  over  the  training  sample.  We  assume  that  conditions  analogous  to  the  favorable 
conditions  preceding  probabilistic  learning  exist  prior  to  differential  learning,  in  order  to  be  sure  that  the 
classifier  learns  a  differential  form  of  the  BDF  !F(X)Baye.i  Differeniiai  €  f  Bayet-DifftrenHai  (see  definitions  2.5 
and  2.6): 

•  We  assume  that  we  have  access  to  an  unlimited  number  of  training  examples,  so  that  we  have  sufficient 

data  to  learn  some  G  ^ Bayes-Oifferenrial  • 

•  We  assume  that  the  discriminator  Q{X\9)  has  sufficient  functional  complexity  to  learn 
^ {X)Ba\tsDiffinmiai  •  Specifically,  we  assume  that  the  classifier’s  parameter  space  0  contains  at 
least  one  point  6“  that  both  maximizes  the  CFM  objective  function  (see  section  2.2.4,  chapter  5,  and 
appendix  D)  over  the  training  set  and  satisfies  the  constraint  Q(X\9‘)  €  Baytu-Diffenntiai  ■ 

•  We  assume  that  the  algorithm  we  use  to  search  for  the  parameters  9"  is  guaranteed  to  find  9‘ ,  given 
sufficient  time  and  computational  resources. 


The  measure  of  CFM  generated  by  a  training  sample  S”  is  the  sum  of  <7  [  >  J  in  definition  2. 1 1  for  each 
example: 


^^See  (91,  Kc.  1 . 12]  for  an  extensive  list  of  error  measures. 
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CFM(5"|e)  =  -  ^  (Cr  [<5r{X^ie),V']  :  =  LJr)  (2.81) 

/=! 

=  E  E  ••  Wj  =  UJr)  (2.82) 

P=I  j=\ 

where  n  is  the  training  sample  size,  and  P,  np,  and  npj  (below)  are  defined  in  section  2.3. 1 .  Equation 

(2.82)  can  be  simplified  by  replacing  the  notion  of  examples  with  the  more  general  notion  of  prototypes: 


nti 


npp  ■  O  [<5.(Xp|O).0]] 


(2.83) 


Thus 


(2.84) 


As  the  training  sample  size  n  grows  asymptotically  laige,  the  empirical  frequencies  converge  to  their 
underlying  probabilities.^^  Thus, 


lim  CFM(5"|0)  =  VPxW 

1  — +  OO  • 

p=\ 


r  c 


5;p„;,x(u;,|Xp).cr[(5,{Xp|o).0] 

L  i=I 


(2.85) 


Simultaneously,  the  number  of  patterns  P  grows  asymptotically  large,  such  that  CFM  (<5"  1 0)  converges  in 
probability  to  the  expected  value  of  the  CFM  objective  function,  which  we  denote  by  Ex  [CFM(X  |  P)] : 


lim 

fl~¥00 

P~^oo 


CFM(5"|P)  =  Ex[CFM(X|»)]  = 


a[5,(x|e).V']  -PwixCt^.lx) 


CFMaid) 


p^{X)dX 


(2.86) 

In  order  to  maximize  Ex  [CT^(X 1 0)]  in  (2.86)  for  any  and  all  Px{X)  on  “X,  w  must  maximize 

CFM(X  1 0)  for  all  X  €  Since  the  6  terms  in  (2.86)  are  not  independent  of  one  another  (see  (2.87) 

and  (2.90)  below)  CFM(X|0)  cannot  be  maximized  term-by-term;  it  must  be  maximized  as  a  whole. 

"See  appendix  B.  Note,  as  in  the  preceding  probabilistic  learning  proofs,  we  are  not  yet  concerned  with  the  specific  rale  at  which  this 
convergence  takes  place. 
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Additionally,  since  |  ■  ]  is  non-negative,  the  common  approach  of  differentiating  CFM(X  1 0)  with 

respect  to  some  function  of  the  S  terms  is  untenable.  Instead  we  must  take  a  less  direct  approach. 

Suppose  we  rank  the  discriminator  outputs,  using  the  subscript  (/)  to  denote  theyth-ranked  output  (not 
to  be  confused  with  the  subscript  j ,  which  denotes  the  output  associated  with  LOj 


y’dl  =  maxgn.(X|e) 

v,2)  =  max  g*(Xle) 

v,,,  1=  max  gt(X|0) 

{(n,(2))  U-oi) 

=  mmgi(X\e) 

Then,  by 

Si  =  yv  —  max  v*  (2.88) 

k^i 

or  equivalently, 

<5.(X|f>)  i  g,(X|0)  -  maxg*(X|»)  (2.89) 

k^i 

each  discriminant  differential  (5^)  can  be  expressed  in  terms  of  the  largest  one  (6(i) )  minus  a  positive 
rank-dependent  value  e(y) : 

<5(11  =  yd)  -  y(2) 

<5(/)  =  yy)  -  y(i)  =  -<5(1)  -  fy) ;  y  >  2  (2.90) 

fy)  =  y(2)  -  yyi  vy  >  2;  e,2)  =  0,  cy)  >  0 

Using  to  denote  the  class  associated  with  the  fth-ranked  output  yy)  and  discriminant  difTeiential  So) , 


c 

CFM(X  ( (?)  =  5^  cr  (X I  ^ ) .  V<J  •  Pw|x(a'(0 1 X) 

1=1 


=  <2”  [6(|)(X|^),0]  •  Pw|x(<^(i)  |X) 


adopt  a  noUKioiial  convention  from  the  field  of  rank  statistics  by  using  subscripts  in  parentheses  to  denote  rank  (e.g.,  [49]). 
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c 

+  X/  ~  ^(>i(X|0),V']  ■  Pwix((^0)  1^)  (2-91) 

J=2 

Since  and  all  f(^)(X|d)  are  non-negative,  CFM(X10)  is  greatest  for  a  given  (5(u(X|0)  if 

fy)  (X 1 0)  =  0  Vy  >  2  (i.e.,  if  all  but  the  top-ranked  output  have  the  same  value).  In  this  case 


CFM{X|»)  = 

<2”  [^(i)(X|0),^]  •  Pw|x{^^(i)  |X)  +  <7  [— 5(|){X|S), V’]  •  (1  -  Pw|x(^^(i)  |X)) 

(2.92) 

=  a  [o.r/>]  +  .?(X)PH|x(tx;,„  |X)  +  -fi(X)(i  -  P»v|x(a;,i,  |X))  . 

' - - - ' 

ACFMtXIfl) 


where  the  perturbations  (t?{X)  and  -'t?{X) )  in  the  value  of  CFM  from  its  equilibrium  value  (7  [O,  V’]  — 
due  to  a  non-zero  discriminant  differential  ($(|)(X|0)  — are 


t?{X)  ^  (J[S(i){x\e),tp]  -  a[o,xJ']  >  0 
-t?(X)  =  aH(n(X|«).0]  -  (7{o.v-]  <  0 


(2.93) 


Since  t?(X)  is  non-negative  and  -it7(X)  is  non-positive  ACFM(X|^)  is  greatest,  given  specific  values  of 
t?(X)  and  n?(X),when  Ct2(i)  =  (J.  (again,  by  (2.8),Cct.  denotes  the  class  with  the  largest  a  postenon 
probability).  In  this  case,  the  discriminator  output  corresponding  to  the  largest  a  posteriori  class  probability 
is  larger  than  all  the  other  outputs.  We  denote  this  Bayes-optimal  output  by  y, ,  and  denote  the  corresponding 
output  differential  by  S.  . 

The  specific  values  of  i?(X)  and  -'t?(X)  that  maximize  CFM(X  |  in  (2.92)  —  values  that  we  denote 
by  ^(X)*  and  -ii?(X)*  — depend  on  the  specific  functional  form  of  (7  [5.(X|^*),V>] .  Clearly  they 
must  satisfy 


i?{X)*  =  <7[<J.(X|»*).V»]  -  cr[0.t'-] 
-.i?(X)*  =  a[-<5.(X|r).t/^)  -  (7[o.(/>]  ’ 


( 


1?(X)*  I  -  P>v|x(C»2.|X)\ 
— i?(X)*  -  P»v|x(U^.|X)  ) 

ACFM(XI0*)>O 


u  (i7(X)*  =  -I?(X)*  =  0) 


ACFM<XI«*)  =  0 


(2.94) 


=  cj..  s.t.  <5n)(X|0)  =  <5.(X|0-) 
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for  ACFM(X|0)  in  (2.92)  to  be  non-negative.  If  no  =  6,{X|0*)  >  0  yields  t?(X)‘  and 

-ii)(X)‘  that  satisfy  the  condition  for  ACFM(X|d‘)  >  0  in  (2.94),  then  CFM(Xle)  is  maximized  when 
(5(,)(X|®)  =  0  Vi  s.t.  ACFM(X(®)  =  0 .  That  is,  all  the  discriminator  outputs  have  the  same  value, 
and  the  optimal  CFM(X|0')  is  the  “equilibrium”  CFM(X|fi")  =  <7  [0,V']  • 

Theorem  2.2  The  CFM  objective  function  (J  [6 ,  V’]  must  be  a  hounded  sigmoidal  function  spanning  a 
continuum  between  a  linear  and  a  step  function  of  S  in  order  to  ensure  that  differential  learning  always 
generates  the  Bayes-optimal  classifier. 

Proof  :  Consider  two  extreme  scenarios: 

•  In  the  first  scenario,  X  represents  C  =  2  classes  {(Ji  ,UJ2}  ■  Thus,  the  smallest  a  posteriori  class 
probability  that  the  more  likely  class  ttt.  can  have  is  P>v|x(^-|X)  =  j  •  Under  these  circumstances, 
(2.94)  is  satisfied  if,  by  (2,90)  and  (2.93),  5.(X  ( ®  * )  >  0  and  <T  [5 ,  t/’]  =  ^ .  In  simple  words,  if  the 
CFM  objective  hinction  is  linear  in  the  discriminant  differential  6,  it  will  generate  the  Bayes-optimal 
classifier  for  the  two-class  case. 

•  In  the  second  (worst-case)  scenario,  X  represents  C  =  oo  classes  {Oti , ...  ,UJoo}-  Thus,  the 
smallest  a  posteriori  class  probability  that  the  most  likely  class  U7.  can  have  approaches  zero 
Pw|x(f^*  |X)  -4  O"'' .  Under  these  circumstances,  (2.94)  is  satisfied  if  and  only  if,  by  (2.90)  and 

(  h,  S  >  0 

(2.93),  <5.(X|0*)  >  0  and  <7  [<5 ,  t/’l  =  <  ,  where  I  and  h  are  real  constants, 

(I,  0  <  0 

and  /  <  A .  In  simple  words,  the  CFM  objective  function  must  be  a  step  function  of  the  discriminant 
differential  <5  in  order  to  generate  the  Bayes-optimal  classifier  for  the  malicious  C  '3>  1  -class  case. 

For  the  more  general  case  that  falls  between  these  two  extremes,  CFM  must  have  a  bounded  sigmoidal  shape 
in  order  for  (2.94)  to  hold  via  (2.90)  —  (2,93).  The  lack  of  a  finite  lower  bound  I  on  <7  [5 ,  t/>]  in  particular 

prevents  the  ratio  from  being  sufficiently  large  to  satisfy  (2.94)  for  all  P>v|x(^'  I X) .  The  lack 

of  a  finite  upper  bound  A  on  [^ ,  V']  generates  classifiers  with  large  discriminant  differentials.  While  this 
phenomenon  is  not  fatal  to  Bayesian  discrimination  (as  the  lack  of  a  finite  lower  bound  is),  it  does  prevent 
the  discriminator  from  learning  those  forms  of  T{X)Bayts  D^tnnthi  for  which  (J.(X  |  ®*)  is  a  small  positive 
number.  As  we  will  see  in  chapters  3  and  6,  if  differential  learning  via  CFM  is  to  be  efficient,  it  must  allow 
the  classifier  to  learn  any  and  a/I  Q(X\0)  €  FBoy«.D,j(f,r(r(in<i(  •  Thus,  CFM  must  have  a  bounded  sigmoidal 

form.  I 
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Remark'.  Section  111  of  [55]  and  section  5.4  provide  a  more  intuitive  rationale  for  the  sigmoidal  form  of  CFM, 
which  might  be  helpful  to  the  reader.  We  stress  that  the  steepness  of  the  CFM  sigmoid  need  not  vary  across 
training  examples;  it  simply  needs  to  satisfy  (2.94)  for  the  worst-case  Pw|x(^*  I  approximated  in  the 
statistics  of  the  training  sample.  Chapter  7  discusses  practical  approaches  to  setting  lit  in  order  to  ensure  that 
(2.94)  is  satisfied. 

We  derive  the  specific  values  of  i)(X)'  and  -•t){X)"  for  the  two  limiting  forms  of  the  CFM  objective 
function  satisfying  the  constraints  of  definition  2. 1 1  and  appendix  D.  The  derivations  assume  that  /  =  0 , 
A  =  I ,  and  —  I  <  5  <  I  in  (2.23)  for  the  sake  of  simplicity.  This  constraint  on  S  requires  that  the 
discriminator  outputs  be  bounded:  Y  €  y  =  [0,1]^.  Since  the  output  state  of  any  classifier  can  be 
normalized  to  (0,  l]‘’,  via  a  simple  affine  transformation,  the  following  derivations  hold  for  the  general 
classifier  with  outputs  Y  €  y  =  . 


Linear  CFM  ( j/'  =  I ):  When  the  confidence  parameter  li'  assumes  its  maximum  value  of  unity,  the 
CFM  objective  function  has  the  following  form;  the  expression  is  approximately  linear  for  all  discriminant 
differentials  6  not  greater  than  one,  otherwise  it  assumes  the  maximum  value  of  unity: 


G  [6,4’  =  1] 


I ,  otherwise 


(2.95) 


The  perturbations  ( t?(X)  and  -<0{X) )  in  the  value  of  CFM  from  its  equilibrium  value  (7  [0 , 4’]  —  due  to 
a  non-zero  discriminant  differential  5.(X|6‘)  — are  therefore 


<7  [0,4'  =  I]  “ 
t?(X)  “ 

-.i?(X)  a 


1 

2 

i<5-(X|r), 

1 

2  ’ 


1 

2  ’ 


6.{X\9‘)  <  I 
otherwise 

X|r)  >  -I 
(  .  rwise 


(2.%) 


and  (2.94)  is  satisfied  if  and  only  if  Pw|x(^«  I X)  >  ^  .  Thus,  for  all  non-boundary  X ,  the  discriminant 
differential  that  maximizes  CFM(X  1 0)  is 


5.(X|f>*) 


I  s.t.  0{X)‘  “  -u?(X)*  “ 

0  s.t.  i?(X)*  =  -.d(X)‘  =  0, 


Pwix(t^-lX)  >  ^ 


otherwise 


(2.97) 
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0,  otherwise 


(2.98) 


Dirferential  learning  via  linear  CFM  therefore  exhibits  the  same  pathology  that  probabilistic  learning 
via  mean  absolute  error  (MAE)  does.  If  f  =  2,  then  a  classifier  that  learns  via  linear  CFM  can  learn 
^(^)Ba\es  Dijfertiuiai-  Howevcr,  if  C  >  2,  then  all  discriminator  outputs  of  the  linear  CPM-gmerated 
classifier  will  have  the  same  value  for  all  regions  on  ^  where  P»v|x(^^*  I X)  <  | .  On  such  regions  the 
linear  CFM-generated  classifier  will  fail  to  identify  the  Bayes-optimal  class  of  X . 


Step  CFM  (0  =  0+):  When  a  [S,l/t  =  0+]  =  | 

<T[0.t/»  =  0+]  =  0 

=  {  0. 

-n?(X)  =  0 

and  (2.94)  is  satisfied  for  all  Pw|x(^*  I X)  •  Thus,  for  all  non-boundaiy  X ,  the  discriminant  differential 
<$.  need  only  be  positive  to  maximize  CI^(X  1 0) , 

S.(XI0‘)  >  0  s.t.  0(xy  =  I.  -i7(X)'  =  0,  (2.100) 

the  discriminator  outputs  all  satisfy  the  constraints  of  (2.9),  and  the  classifier  that  learns  with  the  step  form 
of  CFM  learns 


^.(Xl(?‘)  >  0 
<S.(X|0‘)  <  0 


The  general  CFM  (0^  <  l/’  <  I ):  When  CFM  is  neither  of  its  limiting  forms,  differentiating  <7  [<5 , 0] 
with  respect  to  S  does  not  lead  to  a  closed-form  expression  for  the  specific  value  of  (^.(X|0*)  corresponding 
to  i>{X)*  and  -<t?(X)*  in  (2.94). 

Nevertheless,  the  steepness  of  the  CFM  objective  function’s  sigmoidal  form  —  regulated  by  —  can 

be  shown  to  govern  t?(X)  and  -'(^(X)  thus;“ 

=  a.ioi) 

Indeed,  as  long  as^ 


50 


Chapter  2:  Probabilistic  and  Differential  Learning 


0  <  t/'  <  1 


n 


V' 


<  I  43  P>v|x(^- 1 X) 

—  I  —  Pw|x(^- 1 X) 


(2.102) 


both  (2.9)  and  (2.94)  are  satisfied,  and  maximizing  CFM  ensures  that  the  classifier  learns  a  differential  form 
of  the  Bayesian  discriminant  function: 


sign[(S.(X|0)]  =  sign[Awix(U^.|X)j  V/.  VX  e  ^ 

(2.103) 

^(X|tf)  =  ^ Bayes-Difftrenlial  €  ^ Bayts-Difftrenlial 

Maximizing  CFM  is  tantamount  to  establishing  a  correlation  coefficient  of  unity  between  the  index  of  the 
discriminator’s  largest  output  g(i)(X|0)  and  the  Bayes-optimal  class  label  U),  ,  given  by  (2.8).  In  short, 
differential  learning  is  discriminative.  Chapter  3  proves  that  for  the  limiting  step  form  of  the  CFM  objective 
function  (i.e.,  Iiiti^,_^^.  O  [6 ,  V'] ).  maximizing  CFM  is  equivalent  to  minimizing  the  classifier’s  error 

rate.  That  proof  is  central  to  the  proof  that  differential  learning  via  CFM  is  asymptotically  efficient. 

The  preceding  proofs  make  no  assumptions  regarding  the  functional  form  of  $(X  |  B)  beyond  those 
stated  at  the  beginning  of  this  section,  nor  do  they  restrict  the  number  of  classes  C .  Barnard  proffers  a  less 
general  but  more  elegant  proof  that  differential  learning  leads  to  Bayesian  discrimination  in  [S];  it  is  restricted 
to  the  two-class  case  in  which  the  mappings  gi(X  |  <l)  ate  linear  functions  of  X . 


2.4.1  Further  Constraints  Imposed  on  V’  by  the  Discriminator 


Equation  (2.102)  specifies  an  upper  bound  on  xl\  given  tJtK  a  posteriori  probability  of  the  most  likely  class 
(Pw|x(^*  I X) ),  which  of  course  depends  on  X.  Since  t/'  uniquely  specifies  the  minimum  discriminant 
differential  <5.(X  |  B‘)  for  which  CFM  is  maximized  (this  value  is  in  appendix  D  —  see  figure  D.l, 
page  329),  (2. 102)  implicitly  assumes  that  the  discriminator  Q{X.\B)  can  generate  a  discriminant  differential 
at  least  this  large,  in  order  to  satisfy  the  constraint 


V*  must  satisfy  (2. 102)  and  possibly  be  reduced  further,  such  that 

(5.(X|0*)  >  n  (7[S.{X\B'),rl>]  =  I 


(2.104) 


(D.26) 


In  the  event  that  the  discriminator  cannot  satisfy  (2. 104)  for  the  tji  specified  by  (2.102),  0  must  be  reduced 
to  the  value  at  which  (2.104)  is  satisfied.  In  simple  terms,  the  confidence  parameter  must  be  reduced  so 
that  the  discriminator  maximizes  CFM  for  the  largest  positive  discriminant  differential  6.(X|0*)  it  can 
manage  to  generate,  however  small  6.(X  |  fl*)  might  be.  Under  such  conditions,  the  upper  bound  on  t/’  is 
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deCermined  by  (he  functional  properties  of  the  discriminator  rather  than  by  (he  a  posteriori  class  probabilities 
of  the  feature  vector.  An  illustration  of  this  phenomenon  is  given  in  section  5.3.6. 

2.5  Summary 

In  this  chapter  we  have  outlined  (he  general  statistical  pattern  recognition  paradigm  and  defined  (wo 
fundamental  forms  of  the  Bayesian  discriminant  function;  probabilistic  and  differential.  We  have  shown  that 
each  of  these  BDF  forms  is  associated  with  a  learning  strategy,  and  that  each  learning  strategy  is,  in  turn, 
associated  with  a  family  of  objective  functions  used  to  train  the  differentiable  supervised  classifier.  We  have 
proven  that  both  the  probabilistic  and  differential  learning  strategies  generate  the  Bayes-optimal  classirier, 
given  sufficient  information,  computational,  and  temporal  resources. 

The  probabilistic  and  differential  learning  strategies  lead  to  Bayesian  discrimination  in  substantially 
different  ways.  The  probabilistic  strategy  has  the  distinct  advantage  of  generating  classifiers  that  reflect  the  a 
posteriori  class  probabilities  of  the  feature  vector  in  the  outputs  of  the  discriminator,  whereas  the  differential 
strategy  merely  reflects  the  identity  of  the  most  probable  class.  There  are  clear  advantages  to  having  the 
classifier  estimate  the  a  posteriori  class  probabilities  of  the  feature  vector.  This  fact  might  lead  the  reader 
to  wonder  what  advantage  there  is  in  using  differential  learning  instead  of  probabilistic  learning.  In  fact, 
the  advantage  lies  in  the  efficiency  of  the  differential  learning  strategy  —  that  is,  its  ability  to  approximate 
Bayesian  discrimination  with  the  smallest  training  sample  and  the  least  complex  classifier  necessary  for  the 
task. 

Throughout  this  chapter  we  have  assumed  that  we  have  access  to  unlimited  training  data,  a  classifier  with 
potentially  unlimited  functional  complexity,  a  search  algorithm  that  assures  us  of  finding  the  globally  optimal 
parameterization  of  our  classifier,  and  infinite  time  for  the  algorithm  to  converge.  In  reality  none  of  these 
advantages  exist;  we  face  the  challenge  of  achieving  the  best  discrimination  possible,  given  limited  training 
data,  relatively  simple  classifier  paradigms,  limited  time  for  learning,  and  search  algorithms  that  become 
increasingly  slow  and  subject  to  halting  in  local  optima  as  classifier  complexity  increases.  If  the  ultimate 
objective  is  estimating  the  a  posteriori  class  probabilities  of  the  feature  vector,  then  we  are  compelled  to 
use  probabilistic  learning.  This,  in  turn,  compels  us  to  employ  a  sufficiently  complex  classifier  and  obtain 
a  sufficiently  large  training  sample  if  we  are  to  have  confidence  in  the  classifier's  probabilistic  estimates. 
However,  if  the  ultimate  objective  is  simply  to  classify  patterns,  then  differential  learning  is  a  better  strategic 
choice,  allowing  us  to  achieve  the  goal  of  robust  pattern  classification  efficiently.  Proofs  of  this  claim  are 
given  in  chapters  3  and  6. 


Chapter  3 


Differential  Learning  is  Asymptotically 
Efficient ' 


Outline 

We  present  a  fonnal  definition  of  efficiency  in  the  statistical  pattern  recognition  context.  By  viewing 
the  classifier’s  discriminator  as  an  estimator  of  the  Bayesian  discriminant  function  (BDF),  we  regard  the 
classifier’s  error  rate^  as  an  estimator  of  the  Bayes  error  rate  (i.e.,  the  Bayes-optimal  classifier’s  minimum 
error  rate).  On  this  basis,  we  use  traditional  estimation-theoretic  notions  of  bias  and  variance  to  define  the 
efficient  class^ramd  the  efficient  learning  strategy.  These  definitions,  in  turn,  lead  to  a  quantitative  measure 
of  generalization  — the  ability  of  a  classifier  to  discriminate  accurately  examples  not  encountered  during 
learning.  We  prove  that  differential  learning  is  asymptotically  efficient  and  that  it  requires  the  classifier  with 
the  least  functional  complexity  necessary  for  Bayesian  discrimination.  We  prove  that  probabilistic  learning 
is  inefficient  and  that  it  does  not  guarantee  Bayesian  discrimination  with  the  minimum-complexity  classifier. 

3.1  Introduction 

The  proofs  of  chapter  2  rely  on  favorable  but  unrealistic  assumptions  of  leamability.  Regardless  of  whether 
the  Bayes-optimal  classifier  is  simple  or  complex,  we  have  only  limited  time  and  computational  resources, 
and  —  perhaps  most  importantly  —  limited  access  to  training  data.  In  the  face  of  such  restrictions  the 
efficiencies  of  the  learning  strategy  and  the  classifier  it  generates  become  important. 

In  this  chapter  we  employ  classical  estimation  theory  (e.g.,  (22,  ch’s.  32-34]  [134])  to  define  the 
mean-squared  discriminant  error  of  a  classifier  as  the  expected  squared  difference  between  its  error  rate 
and  the  Bayes  error  rate  (i.e.,  the  minimum  possible  error  rate,  yielded  by  the  Bayes-optimal  classifier  of 
definition  2.1).  The  classifier’s  mean-squared  discriminant  error  is  a  measure  of  its  ability  to  generalize 


'  Sectiom  3.3  and  3.S  contain  detailed  venions  of  proofs  fitst  outlined  in  (S3]. 
^We  use  the  term  “enor  rate' 'as  a  synonym  for  probability  of  error. 
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well  to  patterns  not  encountered  during  learning.  The  efficient  classifier  exhibits  minimum  mean-squared 
discriminant  error,  thereby  guaranteeing  the  highest  probability  of  good  generalization.  The  efficient  learning 
strategy  is  defined  as  one  that  guarantees  the  lowest  mean-squared  discriminant  error  allowed  by  the  a 
priori  choice  of  “hypothesis  class",’  no  matter  what  that  choice  is.  These  definitions  are  shown  to  be  quite 
different  from  the  definitions  of  functional  bias,  variance,  and  mean-squared  error  typically  discussed  in  the 
connection ist,  machine  learning,  and  statistical  pattern  recognition  literature. 

We  show  that  the  differentiable  supervised  classifier  performs  general  Bayesian  learning  (e.g.,  [29,  sec. 
3.S])  in  which  its  discriminator’s  parameterization  is  transformed  to  a  post-learning  state  from  a  “tabula 
rasa”  state  prior  to  learning.  Viewed  ever  all  possible  initial  parameterizations,  learning  transforms  the 
classifier’s^  a  priori  parameter  probability  density  function  (pdO  to  its  posterior  parameter  pdf.  As  a  result, 
we  show  that  the  classifier's  expected  ability  to  approximate  Bayesian  discrimination  depends  entirely  on  the 
posterior  parameter  pdf,  which  in  turn  and  in  part  depends  on  the  learning  strategy  employed. 

We  prove  that  differential  learning  is  asymptotically  efficient  (i.e.,  efficient  for  asymptotically  large 
training  sample  sizes)  and  that  it  requires  the  classifier  with  the  least  functional  complexity  necessary 
for  Bayesian  discrimination;  we  also  prove  that  probabilistic  learning  is  inefficient  and  that  it  does  not 
guarantee  Bayesian  discrimination  with  the  minimum-complexity  classifier.  We  therefore  argue  in  favor  of 
difTerential  learning  and  against  probabilistic  learning  (for  all  but  a  few  special  cases)  when  information  and 
computational  resources  are  limited. 


3.2  Discriminant  Error,  the  Efficient  Classifier,  and  the  Efficient  Learn¬ 
ing  Strategy 


Consider  the  classifier’s  discriminator  as  an  estimator  of  the  Bayesian  discriminant  function  —  or,  more 
precisely,  consider  the  classifier’s  error  rate  as  an  estimator  of  the  Bayes  error  rate  (definition  3.2).*  From 
this  perspective,  the  classifier’s  discriminant  efficiency  can  be  assessed  in  terms  of  discriminant  bias  and 
variance  expressions  that  reflect  how  well  and  how  consistently  the  classifier  approximates  the  Bayes  error 
rate  for  the  pattern  recognition  task.  We  use  the  notation  P,  (•)  to  denote  error  rate,  and  remind  the  reader  that 
r(^(X  1 0))  and  P  (X 1 0) ,  which  are  given  in  (2.6)  and  (2.7),  represent  the  class  label  that  the  classifier 
with  parameterization  0  assigns  to  X.  Thus,  the  probability  that  the  classifier  will  misclassify  X  is 


’The  temi  Hypothesis  class  ante*  in  PAC  leaniing  theory.  In  the  ttatisticai  pattern  recognition  context  it  describes  the  set  of  all 
possible  discriminalon  ^  (X|d),  given  our  choice  of  classifier  paradigm  and  parameter  space.  The  clasrifierpaiadigni  is  delennined 
by  the  functional  baste  of  Ms  discriminani  fimetions  (e.g.,  the  lopstic  functional  basis  of  multi-layer  peroeptrons  and  the  Gaussian  basis 
of  Gaussian  radial  baste  fimetions).  The  set  of  all  C-output  multi-layer  perceptions  with  no  more  than  SOO  total  connections  is  therefore 
an  example  of  a  hypothesis  class.  Please  see  definition  3  J. 

*The  classifier's  and  discriminator's  parameterizations  are  one  and  the  same. 

’Hikunaga  takes  this  perspective  in  (40,  ch.  7],  although  he  considers  the  estimated  error  rate  of  the  classifier,  rather  than  the  true 
etror  rate  to  which  we  refer  (see  definition  3.1).  We  use  the  term  ''error  rate"  when  referring  to  the  classifier's  tnic  probability  of 
error,  and  ''estimated  enor  rate”  when  refening  to  any  empirical  estimate  of  the  classifier's  tnie  probability  of  error.  Of  course,  the 
classifier's  error  rate  is  an  abstraction  —  a  numter  that  we  cannot  know  with  certainty  for  any  real  classifier.  Nevertheless,  the  quantity 
is  central  to  the  arguments  of  this  chapter. 
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P,(a(X|0))  =  1  -  P,v|x(^^(X|0)  |X) 

(3.1) 

=  1  -  Pwix(r(e(x|0))|X) 

Dcflnition  3.1  The  Classifier’s  Error  Rate  (or  Probability  of  Error);  The  classifier's  error  rate,  or 
probability  of  error,  is  the  expected  value  of  its  probability  of  misclassifying  X.  We  denote  this  expectation 
for  the  classifier  with  parameterization  0  by  P,  {Q\0),  where 

P<(eifl)  =  Ex[P,(^(X|0)))  =  I  P4g(X\e))  p^{X)dX  (3.2) 


Remark:  Note  that  this  error  rate  is  the  true  error  rate  of  the  classifier,  not  an  estimate.  As  such,  it  represents 
a  theoretical  number;  knowing  the  number  requires  knowledge  of  the  feature  vector’s  class-conditional  pdfs 
and  its  class  prior  probabilities.  Since  we  do  not  know  these  (if  we  did,  we  could  deterministically  create  the 
Bayes-optimal  classifier),  we  cannot  know  P,  I  However,  this  error  rate  is  essential  to  our  theoretical 
arguments,  so  we  ask  the  reader  to  imagine  that  we  pass  our  classifier  with  the  discriminator  d(X  1 0)  to  an 
oracle.  The  oracle  knows  the  probabilistic  nature  of  the  feature  vector  X  and  can  deterministically  compute 
for  us  the  value  of  P,  ((7 1 0) ,  given  any  and  all  QiX.  \  0). 

Definition  3.2  The  Bayes  Error  Rate:  Recall  from  section  2.2.2  that  ^(X)sa>'»  denotes  the  Bayesian 
discriminant  function  of  X  in  any  of  its  possible  forms.  The  Bayes  error  rate,  which  we  denote  by 
Pf  ,  is  the  error  rate  of  the  Bayes-optimal  classifier  (see  definition  2.1).  By  definition,  this  is  the 

lowest  error  rale  possible  for  a  classifier  of  X : 

P*  =  Ex  [P,  (^{X)ito«)]  =  /  P,(^(X)s„,„)  p,(X)dX  (3.3) 

where 

P,(^(X)*,^,)  =  1  -  P»v|,(r(J^(X)8„„)  |X)  (3.4) 


3.2.1  Learning  and  Expectation 

A  differentiable  supervised  classifier  learns  by  transforming  its  initial  parameterization  (which  we  denote  by 
00 )  into  its  post-learning  (or  posterior)  parameterization  0.  As  described  in  section  2.2.3,  this  transformation 
involves  adjusting  the  differentiable  supervised  classifier’s  parameters  via  an  iterative  search  aimed  at 
optimizing  an  objective  function;  the  learning  strategy  describes  this  process. 
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Definition  3.3  The  Learning  Strategy  A  :  Ultimately,  the  learning  strategy  A  reduces  to  a  description 
of  the  mapping  from  the  classifier's  initial  parameterization  0q  to  its  posterior  parameterization  0 ,  given 
the  training  sample  S"  (see  definition  3.4)  and  the  hypothesis  class  G(©)  (see  definition  3.5): 


A  :  00  0 

0  =  A(Oo|5".G((9)) 


(3.5) 


The  classifier's  initial  parameterization  0o  is  generated  according  to  the  prior  pdf  pp(0o)  on  parameter 
space  &.  The  posterior  parameterization  0  is  stochastic  because  both  0o  and  S"  are;  thus,  A  can  also 
be  viewed  as  an  '  'algorithm ' '  for  general  Bayesian  learning  (e.g.,  (29,  sec.  3.5}). 

Definition  3.4  The  Training  Sample  <5" :  The  training  sample  S”  is  the  set  of  n  example/class  label 

pairs  {{X ' ,  W  ‘),  ...  ,  (X",  W")},  generated  according  to  the  (unknown)  joint  pdf 

Note  that  if  the  training  example  pairs  are  independent  and  identically  distributed  (iid)  the  joint  pdf  of  the 

n 

training  sample  can  be  expressed  as  n 

/=i 

Definition  3.5  The  Hypothesis  Class  G(&):  In  the  statistical  pattern  recognition  context,  the 

hypothesis  class  G(&)  is  the  set  of  all  possible  discriminators  Q(X\0): 


G(0)  =  -.0^0} 

=  {{g,(X|0).....gf(X|(i)}  :  G  ©}. 


(3.6) 


where  O{X\0)  and  g,(X|0)  satisfy  the  conditions  of(2.4)  and  (2.5)  and  represent  functions  in  a  particular 
basis  or  combination  of  bases  G  (e.g.,  polynomial  basis,  Gaussian  radial  basis,  logistic  functional  basis, 
etc.).  We  denote  die  set  of  all  hypothesis  classes  by  such  that  G{0)  C 

Example  3.1  Consider  the  C  =  3-class  pattern  recognition  task  involving  the  scalar  feature  x.  The 
classifier  with  the  discriminator  Q{x\0)  =  {gi(jr|9),g2(j'^|9)tg3(^|0)}  is  used.  Each  discriminant 
function  g/(x  1 0)  €  Q(x\0)  is  a  I  Oth-order  real  polynomial  function  of  x; 

10 

■  W*:  '■  =  >.2.3 

*=o 


(3.7) 


J.2  Discriiniiiaiil  Error  &  Efficiency 


57 


Thus,  the  classiHer  has  a  total  of  33  parameters  (a  different  set  of  1 1  for  each  of  the  3  discriminant  functions).* 
We  denote  all  of  these  parameters  by  6 ,  and  parameter  space  is  the  33-dimensionai  space  of  real  numbers 
(i.e.,  6  £  &  =  ).  The  hypothesis  class  G{S)  in  this  example  is  therefore  the  set  of  all  discriminators 

having  3  discriminant  functions  that  are  at  most  lOth-order  real  polynomials  of  x.  ^  in  this  example  is  the 
set  of  all  discriminators  having  3  real-valued  disaiminant  functions  of  the  scalar  x  —  an  infinitely  larger 
set  of  possibilities  than  G(@). 

In  assessing  how  well  the  classifier/leaming  strategy  will  generalize,  we  must  consider  not  just  one 
error  rate  P,  (Q  \  $)  corresponding  to  one  learning  trial  involving  a  single  training  sample  of  size  n  and  a 
single  initial  parameterization  Oq  ;  we  must  consider  the  expected  error  rate  over  all  such  trials.  From  this 
perspective  and  (3.S),  the  posterior  parameterization  0  depends  on  the  classifier's  initial  parameterization 
00 ,  the  training  sample  S" ,  the  classifier’s  hypothesis  class  G(6) .  and  the  learning  strategy  A.  As  a  result, 
we  can  express  the  expected  value  of  the  classifier’s  error  rate  P,  (Q 1 0)  over  all  possible  training  samples 
of  size  n  and  all  possible  initial  parameterizations  Oq.  We  use  the  notation  [  -  ]  to  denote  the  expectation 
operator  taken  over  the  space  on  which  z  is  defined,  and  E^, ]  to  denote  the  expectation  operator 
taken  over  the  joint  space  on  which  z ,  • . .  ,  C  defined.  The  expected  value  of  the  classifier’s  error  rate 
raised  to  the  fth  power  is  therefore 


(3.8) 


V 


(.15) 


Pr{Q\0  =  A(0o\S\G{0)) 


V 


(3.2) 


if  the  training  examples  are  not  iid,  or 


A'xfi  '  Ixxil  Is 


V 


(3.-5) 


?AQ\9  =  A(<io|5",G(©)) 


(3.2) 


*We  use  the  noUtion  g/(X  1 0)  for  the  rth  dncrimiiiaiit  function  of  the  geneni  feature  vector  X  throughout  this  text.  It  is  somewhat 
misleading  because  it  implies  that  each  discriminant  function  makes  use  of  all  the  classifier’s  parametets.  Although  this  may  be  true,  it  is 
not  necessarily  so;  in  the  case  of  the  piesenl  example,  none  of  the  discriminant  fimctions  sham  a  parameter  with  any  other  discriminant 
function.  In  other  cases  (e.g.,  multi-layer  perceptrm  dassifters)  different  discriminant  functions  do  share  common  parameters.  We  leave 
these  details  implicit  in  the  interest  of  simplified  notation. 
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p„{eo)d0op^,w{:>^\w^')d{x\w^') ...  />x^.(x".w;)rf(x".>v;)  (3.9) 


if  the  training  examples  are  iid. 

3.2.2  Discriminant  Error  and  the  Efficient  Classifier 

Armed  with  the  expressions  in  (3.8)  —  (3.9),  we  can  assess  the  classifier’s  discriminant  error  —  the  degree 
to  which  its  error  rate  exceeds  the  minimum  Bayes  error  rate.  We  can  characterize  the  expectation  of 
this  discriminant  error  over  all  learning  trials  in  terms  of  the  traditional  notions  of  an  estimator’s  bias  and 
variance.  These  metrics  r  low  us  to  assess  how  well  and  how  consistently  the  classifier  approximates  the 
Bayes-optimal  classifier. 

Deflnition  3.6  Discriminant  error:  V/e  define  the  discriminant  error  as  the  difference  between  the 
classifier’s  error  rate  and  the  Bayes  error  rate: 

DEtror  [g\0]  =  P,  (^  | «)  -  P,  (3.10) 


Remark:  Since  the  Bayes  error  rate  is  the  minimum  achievable,  the  discriminant  error  is  always  non-negative; 

0  <  DError[0|6]  <  1  -  P,  <1  (3.11) 

Definition  3.7  Discriminant  bias:  We  define  the  discriminant  bias  as  the  expected  value  of  the 

classifier’s  discriminant  error,  using  the  notation  DBias  \Q  |  n,  G{0),  A]  to  signify  that  discriminant  bias 
(as  the  expressions  for  discriminant  variance  and  mean-squared  discriminant  error  that  follow)  ultimately 
depends  on  the  training  sample  size  n,  the  hypothesis  class  G(0)  .  and  the  learning  strategy^  A.  This 
dependence  is  made  clear  by  (3.5)  —  (3.9): 


DBias  [Q I n,  G(0),  A]  =  E^. [DError  [Q\9]] 

=  (3.12) 

Remark:  Since,  by  (3.1 1),  the  discriminant  error  is  always  non-negative,  the  expectation  of  this  error  (i.e., 
the  discriminant  bias)  is  always  non-negative: 

0  <  DBias[(?|n,G(©),A]  <  1  -  P,  <  1  (3.13) 


^The  dependence  on  the  protMbilistic  nature  of  X  is  left  imphcit  in  order  to  simplify  notation. 
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Deflnition  3.8  Discriminant  variance:  We  define  the  discriminant  variance  as  the  second  central 

moment  of  the  classifier's  error  rate: 

DVar[^?|n.G(0),A]  ^  J(p,  I »)  -  1 0)])' 

=  Es.,fl„((P.(^l»))'l  -  (E.s.,e„[P.(^?|0)])'  (3.14) 

Definition  3.9  Mean-squared  discriminant  error  (MSDE):  We  define  the  mean-squared  discriminant 
error  (MSDE)  as  the  expected  value  of  the  squared  discriminant  error:^ 

MSDE[g|fi.G(0).A]  =  E^,  <,J(DEiTor[e|0])'] 

=  Es-',eA(PAOl0) 

=  {DBias[g|n,G(0).A])^  +  DVar[a|n.G(0).A]  (S.'S) 


Remark:  We  view  the  mean-squared  discriminant  error  as  a  measure  of  the  classifier’s  ability  to  generalize 
well:  the  lower  the  MSDE,  the  better  the  classifier  generalizes.  The  quantity  (DBias  [Q  | n,  G(0),  A] 
measures  how  well,  on  average,  the  classifier  discriminates  in  comparison  to  the  Bayes-optimal  classifier, 
DVar  [Q I  n,  G(0),  A] )  measures  how  consistently  the  classifier  discriminates  over  multiple  independent 
learning  trials. 

Definition  3.10  The  asymptotkaliy  unbiased  classifier:  The  classier  is  an  asymptotically  unbiased 
estimator  of  the  Bayes-optimal  classifier  if  its  discriminant  bias  is  zero  for  asymptotically  large  training 
sample  sizes: 


DBias  [0|n,G(0),  A]  =0  (3.16) 

Definition  3.11  The  consistent  ciassifier:  The  classifier  is  a  consistent  estimator  of  the  Bayes-optimal 
class^r  if  its  error  rate  Pe{Q\9)  converges  in  probability  to  the  Bayes  error  rate  plus  some  non-negative 
constant  fi  as  the  training  sample  size  grows  large: 

*Naie  dm  MSDE — the  mean-iquaicd  diffetence  betwwn  the  cUssiner's  error  rate  and  the  Bayes  etror  rate  —  is  nor  the  same  thing 
as  MSE  — the  mem-»qvKedfii»ctiimid  error  (described  in  chapter  2  and  section  3.4)  between  the  classifier's  discriminant  functions 
and  the  strictly  probabilistic  fonn  of  the  BDF. 
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Jjin  p\Pf{Q\$)  -  P,  -  ft  <  e  \  \  -,  £>0 

DEiror  1 0] 


(3.17) 


If  Pe(Q\  P)  converges  in  mean-square  to  the  Bayes  error  rate  plus  some  non-negative  constant  ft 


lim  Ec,  A 
*  1  '’n 


/ 

Pride)  -  P, ( - /? 

[\  DError[5|<?]  J  J 


=  0 


(3.18) 


(3.17)  is  satisfied,^  and  the  classifier  is  consistent.  Note  that  (3. 18)  holds  if 


lim  DEnor 

fOO 


g\0  =  A((>o|5*.G(e)) 


=  /i  V{5",eo} 


(3.19) 


Definitioa  3.12  The  efficient  classifier: 

Let  ^  denote  the  set  of  all  possible  learning  strategies  and  recall  that  ^  denotes  the  set  of  all  hypothesis 
classes.  The  classifier  ViX)  =  r(^*(X|fl*  =  A.(®o|'5",  G{©)*)))  generated  from  the  hypothesis 
class  G(©)*  €  ^  by  the  learning  strategy  A.  is  efficient  for  a  given  training  sample  size  n  if  and 
only  if,  given  a  feature  vector  X  with  specific  class-conditional  pdfi  |/7j|yy(X  |Ct7|  ) . 

and  class  prior  probabilities  {Pvv(t»^i) . Pw{^c)}  .there  exists  no  other  classifier  in  ^  that  exhibits 

lower  MSDE: 


P*(X)  =  r(e*(X|»*  =  A.(^o|«S«.G{©)*)))  isefTicient# 

MSDE[e*|«,G{©)*.A.]  <  MSDE[a|«.G(©).A]  (3.20) 

VG(©)  €  ^.  VA  €  II;  Q'{\\9-)  €  G{©)*.a{X|e)  €  G(©) 


Remark:  The  efficient  classiHer  P*(X)  generalizes  best  because  it  exhibits  the  minimum  MSDE.  By  this 
definition,  the  efficient  classifier  always  exists,  since  there  is  always  some  classifier  that  exhibits  lower 


’Convefienoe  in  mean-«quare  guarinlees  convergence  in  probibility;  see.  for  example,  (45,  pp.  148-149]. 
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Figure  3.1:  The  discriminant  bias  and  discriminant  variance  of  three  different  classifier  paradigms,  as 
determined  by  an  oracle  over  an  infinite  number  of  independent  learning  trials.  The  training  sample  size  n 
is  the  same  finite  number  for  each  trial,  so  each  classifier’s  error  rate  varies  across  trials.  Left:  this  classifier 
has  high  discriminant  bias,  so  on  average  its  error  rate  is  significantly  higher  than  the  Bayes  error  rate 

P«  {^Bayts)-  Additionally,  its  high  discriminant  variance  indicates  that  its  error  rate  fluctuates  substantially 
across  independent  trials.  As  a  result,  its  mean-squared  discriminant  error  (MSDE)  is  high.  Middle:  this 
classifier  has  high  discriminant  bias,  so  on  average  its  error  rate  is  significantly  higher  thah  the  Bayes  error 
rate.  However,  its  low  discriminant  variance  indicates  that  it  is  a  more  consistent  classifier  than  the  one  on 
the  left;  as  a  result,  its  MSDE  is  lower  and  it  is  preferable  to  the  classifier  on  the  left.  Right:  this  classifier  has 
low  discriminant  bias  and  low  discriminant  variance.  As  a  result  it  yields  a  consistently  good  approximation 
to  the  Bayes  error  rate.  Its  MSDE  is  therefore  low. 


MSDE  than  any  other  classifier.  Readers  familiar  with  the  classical  definition  of  the  efficient  estimator  will 
note  that  our  definition,  like  that  of  [39,  ch.  5],  is  less  restrictive  than  that  of  [  1 07]  [22,  ch.  32].'° 

Example  3.2  For  those  not  familiar  with  the  notion  of  an  efficient  estimator,  consider  the  error  rates  of  three 
different  classifiers  that  have  learned  to  perform  the  same  pattern  recognition  task.  Each  classifier  therefore 
represents  a  different  estimator  of  the  Bayes-optimal  classifier.  This  is  a  thought  experiment  in  which  we 
imagine  that  the  classifiers  learn  repeatedly  over  an  infinite  number  of  trials.  In  each  trial  all  three  classifiers 
learn  the  same  training  sample  of  size  n  (n  is  finite),  and  are  subsequently  tested  by  the  oracle.  The  training 
sample  for  each  trial  is  drawn  independent  of  all  other  training  samples.  At  the  end  of  each  trial,  the  error  rate 
for  each  classifier  is  determined  by  the  oracle  and  recorded;  the  results  for  all  trials  are  compiled.  Figure  3. 1 
summarizes  the  results  for  each  of  the  three  classifiers.  Because  the  training  sample  size  n  is  the  same 
finite  number  for  each  trial,  each  classifier’s  posterior  parameterization  (and,  as  a  result,  its  error  rate)  varies 
from  trial  to  trial.  This  variance  is  depicted  by  the  bars  of  the  whisker  plots  in  figure  3.1.  Specifically,  the 
discriminant  variance  is  proportional  to  the  square  of  the  distance  between  the  upper  and  lower  bounds  in 

'"The  clasdcal  defintlioii  of  an  efficient  eetiiiMtor  requires  that  it  be  unbiased  and  that  its  variance  match  the  Cramer-Rao  bound: 
by  this  more  risorous  definition,  the  efficiem  estimator  does  not  always  exist.  While  the  Cramer-Rao  bound  is  cleariy  defined  for 
the  parameter  estimation  context,  it  is  unclear  whether  there  is  an  analogous  bound  in  the  pattern  recognition  context.  Please  refer  to 
section  3.6  for  more  on  this  subject. 
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each  whisker  plot.  The  discriminant  bias  of  each  classifier  is  equal  to  the  distance  between  the  mean  value 

of  its  whisker  plot  (denoted  by  the  dot)  and  the  horizontal  line  denoting  the  Bayes  error  rate  P,  (J-Baxei)- 

The  classifier  on  the  left  is  a  poor  estimator  of  the  Bayes-optimal  classifier  because  it  exhibits  both  high 

discriminant  bias  and  high  discriminant  variance.  This  means  that  1)  on  average  the  classifier's  error  rate 

is  much  greater  than  the  Bayes  error  rate  (high  discriminant  bias),  and  2)  the  classifier’s  error  rate  varies 

significantly  across  trials  (high  discriminant  variance).  As  a  result,  the  classifier  exhibits  high  MSDE.  The 

classifier  in  the  middle  is  a  somewhat  better  estimator  of  the  Bayes-optimal  classifier  because,  although  it 

exhibits  the  same  high  discriminant  bias  as  its  counterpart.on  the  left,  its  error  rate  is  more  consistent  across 

trials.  As  a  result,  it  exhibits  lower  MSDE.  The  classifier  on  the  right  is  a  good  estimator  of  the  Bayes-optimal 

classifier  because  it  exhibits  low  discriminant  bias  and  its  error  rate  is  consistent  across  trials.  As  a  result  it 
exhibits  low  MSDE.  Whether  or  not  this  classifier  is  efficient  depends  on  whether  it  satisfies  the  constraints 

of  definition  3.12. 

Measuring  the  goodness  of  an  estimator  by  its  mean-squared  error  has  a  long  history  in  the  estimation 
theory  literature,  dating  back  to  Gauss."  R.  A.  Fisher,  H.  Cramer,  and  C.  R.  Rao  played  central  roles 
in  further  defining  the  “efficient"  estimator  as  one  that  exhibits  minimum  mean-squared  error  (see,  for 
example,  [107,  22,  134,  39]).  In  fact,  it  is  this  body  of  literature  that  motivates  us  to  view  the  classifier’s 
error  rate  as  an  estimator  of  the  Bayes  error  rate.  The  preceding  definitions  of  discriminant  bias,  discriminant 
variance,  mean-squared  discriminant  error,  and  the  efficient  classifier  follow  immediately  from  such  a  view. 
Again,  we  remind  the  reader  that  these  definitions  are  quite  different  from  tne  definitions  of  functional  bias, 
variance,  and  mean-squared  error  typically  discussed  in  the  connectionist,  machine  learning,  and  statistical 
pattern  recognition  literature. 


3.2.3  Efficient  Learning 

Despite  the  similarities  between  an  efficient  estimator  and  an  efficient  classifier,  there  are  notable  differences. 
In  the  classical  estimation  context,  we  have  a  unique  parametric  model,  and  the  efficient  parameter  estimator 
—  if  it  exists  —  is  uniquely  specified.  In  the  pattern  recognition  context,  there  are  an  infinite  number  of 
hypothesis  classes  with  which  to  approximate  the  Bayes-optimal  classifier  of  X.  Each  hypothesis  class 
constitutes  a  different  parametric  model,  and  some  choices  will  be  better  than  others  in  terms  of  the  minimum 
MSDE  they  can  attain  for  a  given  training  sample  size  of  the  random  feature  vector.  Recognizing  this,  we 
must  acknowledge  that  our  choice  of  hypothesis  class  G(S)  might  not  contain  the  efficient  classifier  of 
definition  3.12.  In  such  a  case,  we  would  like  the  classifier  thai  our  learning  strategy  generates  to  exhibit  the 
lowest  MSDE  allowed  by  the  choice  of  G(0).  This  implies  a  notion  of  relative  efficiency  (c.g.,  [91]),  which 
depends  in  part  on  whether  or  not  the  hypothesis  class  constitutes  a  proper  parametric  model  of  the  feature 

"  C.  R.  Rao  traces  the  notion  of  the  irinimum  mean-squaied  error  estimator  to  a  paper  that  Gauss  presented  to  the  Royal  Society  of 
Gottingen  in  1809(108,  pg.  123). 
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vector  X. 

Definition  3.13  The  proper  parametric  model  G{0,\)propfr'  If  the  C -class  random  vector  X  has 
a  posteriori  class  probabilities  that  are  described  by  the  parametric  equations 

Pw|,(a;,|X)  = /(Xir),  /=l . C  (3.21) 

and  if  the  hypothesis  doss  G(0)  describes  the  set  of  all  discriminators  Q(X\0)  for  which 

gi(\\0)=M^\9),  i=\ . C  (3.22) 

such  that 

g,(x\e)  =  Pvv>(a;,  |X)  v/  ,f  e  =  e- ,  (3.23) 

then  G(0)  is  the  proper  parametric  model  of  X,  which  we  denote  by  G(0,  X)pr„ptr-  The  proper  parametric 
model  of  X  exists  if  and  only  if  the  a  posteriori  class  probabilities  (and  the  class  conditional  pdfs)  of  X  can 
be  described  in  parametric  form. 

Remark:  If  G(0,  exists,  it  is  unique,  as  there  is  one  and  only  one  exact  set  of  parametric  expressions 

for  the  a  posteriori  class  probabilities  of  X.  This  uniqueness  assertion  rests  on  the  difference  between 
functional  identity  and  functional  equivalence.  The  proper  parametric  model's  discriminant  functions  are,  for 
the  correct  choice  of  parameters  6‘,  identical  to  the  a  pojfer/ort  class  probabilities  of  X,  by  (3.23).  Granted, 
Bayes  rule  allo’  /s  the  a  posterior '  c\ass  probabilities  of  X  to  be  expressed  in  terms  of  its  class<onditional 
pdfs  (i.e.,  its  likelihood  functions)  and  class  prior  probabilities  via  (2.3).  This  representation  is  also  unique,  as 
there  is  one  and  only  one  exact  set  of  expressions  for  the  class-conditional  pdfs  and  class  prior  probabilities 
of  X.  Thus,  if  G(0,X)pr„pfr  exists,  it  can  be  expressed  in  one  of  two  forms:  directly,  as  a  "partially 
parametric"  [91,  pg.  255]  form  equating  to  the  a  posteriori  class  probabilities  of  X,  or  indirectly,  as  a 
fitlly  parametric  form  equating  to  the  products  of  class-conditional  pdfs  and  class  prior  probabilities.  Many 
readers  will  recognire  the  fully  parametric  form  of  G(0.  X)pniprr  bs  the  foundation  of  maximum-likelihood 
parameter  estimation. 

Definition  3.14  An  improper  parametric  model:  An  hypothesis  class  that  is  not  the  proper  parametric 
model  of  X,  as  defined  above,  is  an  improper  parametric  model  of  X  (i.e.,  G{&)  ^  G{0,  X)pmp,rl 

Remark:  Readers  familiar  with  parametric  discrimination  (e.g.,  (29, 40, 9 1 ))  will  note  that  proper  parametric 
models  are  what  White  terms  “correctly  specined"  parametric  models  (140);  improper  parametric  models 
are  what  White  terms  “misspecified"  parameuic  models.  So  called  “non -parametric"  models  for  statistical 
pattern  recognition  arc  named  thus  because  the  models'  discriminant  functions  are  not  exact  expressions  of 


64 


Chapter  3:  Differential  Learning  is  Asymptotically  Efficient 


(he  feature  vector’s  class-conditional  pdfs  or  a  posteriori  class  probabilities.  The  nomenclature  is  unfortunate, 
because  it  incorrectly  implies  that  such  models  have  no  parameters.  We  take  the  view  that  all  models  are 
parametric,  so  all  differentiable  “non-parametric”  models  that  are  trained  in  supervised  fashion  are,  by  our 
definition,  improper  parametric  models. 

TTiere  is  at  most  one  proper  parametric  model  (which  can  be  expressed  either  fully  or  partially)  versus 
an  infinitely  large  set  of  improper  parametric  models  f  given  feature  vector  X.  Thus,  we  are  infinitely 
more  likely  to  choose  an  improper  parametric  model  of  X  absent  any  information  or  analysis  suggesting  the 
proper  parametric  model.  Kolmogorov’s  theorem  {77J  can  be  interpreted  as  proving  that  without  strong  a 
priori  information,  G(0,  X)prnptr  —  if  it  exists  at  all  —  can  be  identified  only  by  exhaustive  hypothesis 
testing  of  all  models  in  in  practical  terms,  we  might  test  a  few  hypothesis  classes  to  see  if  any  of 
them  constitutes  G(&,X)pmp„  with  high  likelihood.  If  one  of  the  candidate  G(0)s  does,  then  a  specific 
form  of  probabilistic  learning,  tailored  to  G(0,X)p,^„  might  very  well  generate  the  efficient  classifier 
P*(X)  of  definition  3. 1 2  (see  section  3.6  and  chapter  4).  If  none  of  the  candidates  prove  likely  to  constitute 
G(0,  X)fmptr  we  will  want  to  achieve  the  best  generalization  allowed  by  our  choice  of  improper  parametric 
model,  whatever  that  choice  ultimately  is.  This  desire  raises  the  issue  of  the  relatively  efficient  classifier. 

Definitiofl  3. 1 5  The  relatively  efficient  classifier; 

The  class ffierT>*(X)  =  r(0(X|O*  s=  A*(®o|<5*.  G{©))))  generatedfrom  the  hypothesis  class  G{S) 
by  the  learning  strategy  A*  is  relatively  efficient  for  a  given  training  sample  size  n  if  and  only  if,  given 
a  feature  vector  X  with  specifK  class-conditional  pdfs  |U7i) . /?,|yy(X  lo’c)!  and  class 

prior  probabilities  {PyviUli) . Pw(UJc)},  there  exists  no  other  classifier  in  G(0)  that  exhibits  lower 

MSDE: 


V*{\)  =  r(0(X|®*  =  A,(®o|'5",G(©))))  is  relatively  efficient  iff 

r  E[(?|n,G(©).A.J  <  MSDE(^|n,G(©).A]  (3.24) 

vG(a)  €  ^.  VA  e  II;  €  G(0),ff(Xl»)  e  G(e) 


Remark:  The  relatively  efficient  classifier  V*(X)  exhibits  the  lowest  MSDE  allowed  by  the  choice 
of  hypothesis  class.  Whether  or  not  it  is  the  efficient  classifier  of  definition  3.12  (i.e..  whether  or 
not  P*(X)  =  P*(X) )  depends  upon  the  choice  of  hypothesis  class  G(®).  If  G{0)  =  G(0)* 
then  P*(X)  =  P*(X);  otherwise,  P*(X)  is  the  closest  approximation  to  P*(X)  (as  measured  by 
MSDE  [dn,G(®),  A.]  )  allowed  by  G(©). 
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Deflnition  3.16  The  efficient  learning  strategy:  The  learning  strategy  A*  is  efficient  if  and  only 

if,  given  a  feature  vector  X  with  arbitrary  class-conditional  pdfs  |^,|yy(X|CJi) . p,Hy(X 

and  class  prior  probabilities  {P\v(i^i) . Pn  i^^c)},  any  training  sample  size  n,  and  any  proper 

or  improper  parametric  model  G{0)  G  A.  always  generates  the  relatively  efficient  classifier  of 
definition  3. 15. 

Remark:  The  efncient  learning  strategy  always  guarantees  the  lowest  MSDE  allowed  by  the  training  sample 
size  and  the  hypothesis  class;  the  guarantee  holds  for  any  and  all  training  sample  sizes.  Frankly,  we  doubt 
that  this  kind  of  universally  efficient  learning  strategy  exists,  for  the  simple  reason  that  we  cannot  conceive 
of  one  single  learning  strategy  that  can  produce  the  relatively  efficient  classifier  for  both  improper  and  proper 
parametric  models  of  the  feature  vector  (see  section  3.6).  Regardless  of  whether  or  not  the  efficient  learning 
strategy  exists,  a  less  stringent  form  of  asymptotically  efficient  learning  certainly  does  exist. 

Definition  3.17  The  asymptotically  efficient  learning  strategy:  The  learning  strategy  A-»*  is 

asymptotically  efficient  if  and  only  if,  given  a  feature  vector  X  with  arbitrary  class-conditional  pdfs 

{/?,|w(X  I  Cd| ) . I  ^c)}  and  class  prior  probabilities  {P>v(tdi ) . Pwi^c)},  on  "asymp¬ 

totically  large"  training  sample  size  n,  and  any  proper  or  improper  paramemc  model  G{&)  € 
there  exists  no  other  learning  strategy  that  produces  a  classifier  from  G{0)  with  lower  MSDE: 

A_+*  is  asymptotically  efficient  iff 
lim„_^  MSDE[^1«.G(0),A_.)  <  MSDE[^|n,G(0),A] 

(3.25) 

V  ({p,|w(X|cd,) . /j,|^.(X|u;c)}  ,  {Pvv(a;,) . pmuJc)})  .  vG(©)  e 

VA  €  It;  ^(X|f>"*  =  A^*(eol5-,G(0)))  G  G(©),e(X|0)  g  G(©) 


Remark:  There  is  only  one  difference  between  the  efficient  learning  strategy  and  the  asymptotically  efficient 
learning  strategy:  efficient  learning  is  guaranteed  to  generated  the  relatively  efficient  classifier  for  large 
and  small  training  sample  sizes;  asymptotically  efficient  learning  is  guaranteed  to  generated  the  relatively 
efficient  classifier  for  large  training  sample  sizes  only. 

Characterizing  a  learning  strategy  as  asymptotically  efficient  is  a  strong  statement  for  two  reasons: 

•  First:  Asymptotically  efficient  learning  is  guaranteed  to  generate  the  relatively  efficient  classifier, 
given  cuiy  hypothesis  class,  as  long  as  the  training  sample  size  is  sufficiently  large.  Although  this 
might  not  seem  to  be  a  strong  statement  in  absolute  terms,  it  does  indicate  that  asymptotically  efficient 
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learning  is  preferable  to  inefficient  learning.  That  is,  a  learning  strategy  that  always  generates  the 
relatively  efficient  classifier  for  large  training  sample  sires  is  preferable  to  any  alternative  strategy  that 
usually  fails  to  generate  the  relatively  efficient  classifier  fora/iv  training  sample  size.  The  relevance  of 
this  obvious  statement  to  current  machine  learning  strategies  for  statistical  pattern  recognition  becomes 
clear  in  section  3.4,  wherein  we  prove  that  probabilistic  learning  is  inefficient. 

•  Second:  Asymptotically  efficient  learning  generates  the  relatively  efficient  classifier  for  small  as  well 
as  large  training  sample  sizes,  given  most  choices  of  hypothesis  class.  It  fails  to  generate  the  relatively 
efficient  classifier  for  small  n  only  when  the  hypothesis  class  is  a  good  appro.simation  of  the  proper 
parametric  model  of  X  (i.e.,  when  G(0)  ^  G(0,X)pr„p,r  — see  section  3.6,  chapter  4,  and 
section  8.5). 

3.3  Differential  Learning  is  Asymptotically  Efficient 

Recall  that  S”  denotes  the  training  sample  of  size  n.  By  (2.86),  the  sample  average  value  of  the  CFM 
objective  function  converges  to  its  expected  value  over  all  feature  vectors  as  the  training  sample  size  grows 
large: 


lim  CFM  (5"  I  e)  =  Ex[CFM(X|0)] 


P^OC' 


-k 


a[5,(X|e),t'j  •  Piv|x{u;.|X) 


(=1 


CFMtX  I  fl) 


p^{X)dX  (3.26) 


Recall  from  definition  2.1 1  and  (2.26)  that  the  CFM  objective  function  (7  [<5,(X  1 6) ,  t/']  becomes  a  step 
function  as  its  confidence  parameter  0  goes  to  zero: 


f  0.  <5,(X1R)  <  0 

lim  Cr[<5,(X|fl).0]  =  { 

(  1.  <y,(xi»)  >  0 


(3.27) 


Finally,  recall  from  (2.87)  —(2.90)  that  a  positive  discrimiiiant  differential  (5,{X|R)  indicates  that  the 
corresponding  ith  discriminator  output  is  greater  than  all  other  outputs  (  6,(X|R)  =  6(n(X|R)  > 
0  »j!fg«(X|®)  >  g/(X|®)  V7  /  i  )  such  that  the  class  label  assigned  to  X  is  r(0(X|R))  =  =  a;(i). 

Given(3.27),  CFM(X|®)  in  (3.26)  has  only  one  non-zerolerm,  corresponding  toclass  r(5{XlR))  = 

See  appendix  B. 
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Thus,  CFM(X  1 0)  converges  to  the  a  posteriori  probability  of  class  07(0  (i.e.,  the  class  corresponding  to 
the  discriminator’s  largest  output): 


lim  CFM(X|e)  =  P»v|x(t^(i)|X) 

n— ►'X  ’ 

=  Pvvix(r(e{X|o)|X) 

=  l-P,(5(X|0))  (3.28) 

> - - / 

(.VI) 

As  a  result,  the  expected  value  of  CFM(X  1 0)  converges  to  one  minus  classifier’s  error  rate; 


lim  Ex[CFM(X10)] 
ij'-iO* 


By  (3.2)  and  (3.10), 


Pr(e(X|0))  PxWdX 


(.V2) 


(3.29) 


lim  Ex[CFM(X|0)]  =  \-V,{Q\9) 

V'-*o+ 

=  1  -  P,  (Tsmts)  -  DError  [Q  \  0]  (3.30) 

constant 

Equations  (3.26)  —(3.30)  prove  that  the  CFM  objective  function  (in  its  step  functional  limiting  form) 
converges  to  a  constant  minus  the  classifier’s  discriminant  error,  given  an  asymptotically  large  training 
sample  size.  By  this  result,  we  state  the  following  theorem: 

Theorem  3. 1  In  the  limit  that  the  CFM  objective  function  becomes  a  step  function,  the  associated  differential 
learning  strategy  described  in  chapter  2  is  asymptotically  efficient.  The  asymptotic  efficiency  of 
d^erential  learning  is  independent  of  both  the  probabilistic  nature  of  the  feature  vector  and  the  hypothesis 
class  from  which  the  classier  is  generated. 

Assumption  3.1  We  assume  that  employsa  searchalgorithm(i.e.,  a  numericaloptimization  proceaare) 
that  is  guaranteed  to  find  the  posterior  parameters  9^  that  maximize  the  sample-average  value  of  the 
CFM  objective  function,  given  X  and  the  hypothesis  class  G(&).  regardless  of  the  discriminator's  initial 
parameterization  Oo.  That  is,  we  assume  that  the  search  algorithm  will  not  halt  in  a  local  maximum. 
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Proof  :  The  differential  learning  strategy  Aa  maximizes  the  CFM  objective  function  for  the  training 
sample  S"  ;  the  maximum  is  found  with  respect  to  the  classifier’s  parameters,  as  described  by  definitions 
2.8  and  2.10.  We  use  =  y\A(®o|<5'' ,  G(0))  to  denote  the  posterior  parameterization  of  the  classifier 
generated  by  the  differential  learning  strategy,  given  the  step  functional  form  of  CFM  when  V'  — t  0+  ,  the 
training  sample  5",  the  initial  parameterization  Bq,  and  the  hypothesis  class  G(0).  By  (.1.26)  —(3.30), 
maximizing  the  step  form  of  CFM  is  equivalent  to  maximizing  a  constant  minus  the  classifier's  discriminant 
error  as  the  training  sample  size  n  grows  large.  As  a  result, 

^lun  I  -  P,(,Fb<„„)  -  DError[e|e^  =  I*?" . G(0))] 

>  1  -  -  DError(^|0]  Ve  e  0  (3.31) 

By  (3. 1 1 ),  we  know  that  DError  [Q  |  O]  is  always  non-negative,  so  (3.3 1 )  leads  to'^ 

^DError  I =  AA(®o|<5",G(0))|j  =  min  (DError  [^  |^])‘  ;  ii  >  0  (3.32) 

or 

Jnn  (nErrw[^ie'^  =  Aa(Oo  |5" .  G(0))])‘  <  (DError  [^  |  O) ) '  Vt?  G  0;  o  >  0  (3.33) 
V’-»o+ 

That  is,  maximizing  the  step  form  of  CFM  is  equivalent  to  minimizing  the  classifier’s  discriminant  error 
for  asymptotically  large  training  sample  sizes.  Clearly,  if  we  minimize  the  classifier’s  discriminant  error 
for  each  trial  involving  an  independently  drawn  training  sample  S”  and  an  independently  drawn  initial 
parameterization  9o  f  then  we  minimize  the  expected  value  of  the  classifier’s  discriminant  error  over  all  such 
trials.  By  (3.8)  —  (3.9)  and  (3.33), 

Jjm  E^.  <,J(DError[g|«^  =  Aa«?o|5-.G(0))])‘] 

<  E5.^jJ(DError[g|e  =  A(®o|5".G(0))])‘']  VA  6  o  >  0  (3.34) 

It  followt  immediately  from  definitions  3.7  and  3.9  that  maximizing  the  step  form  of  CFM  minimizes  the 
classifier's  discriminant  bias  and  MSDE  for  asymptotically  large  training  sample  sizes: 

’’The  followiiif  equMtom  hoM  for  all  training  tample/initial  paiamewrization  combinMions  (i.e..  V  {5".0o} )  owing  to  the 
coBveifcncc  propeitiei  of  the  training  tatnple  (*ee  appendix  B)  and  the  aiwinption  of  theorem  3. 1 . 
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Hm  DBias  |n,  G(@),  Aa]  <  DBias  |/i,  G(0),  A]  V  A  6  (3.35) 

t/’->0+ 


Jhn  MSDE[i?(H.G(0).AA]  <  MSDE  |n,  G(0),  A]  VA  e  4,  (3.36) 

V’-»o+ 


Since  (3.36)  holds  regardless  of  the  choice  of  hypothesis  class  G(0)  or  the  probabilistic  nature  of  X ,  it  is 
equivalent  to  (3.25)  in  definition  3.17.  Thus,  differential  learning  via  the  CFM  objective  function  ( t/’  0'*' ) 

is  asymptotically  efficient.  I 

Remark:  The  preceding  proof  is  significant  because  it  guarantees  that  differential  learning  generates 
the  relatively  efficient  classifier  of  definition  3.15  — regardless  of  G(0)  — as  long  as  the  training 
sample  size  is  sufficiently  large.  At  present,  we  know  of  no  other  learning  strategy  that  provides  this 
guarantee.  We  emphasize  that  establishing  the  asymptotic  efficiency  of  differential  learning  does  not 
refute  the  existence  of  some  other  asymptotically  efficient  learning  strategy  Ao  with  better  convergence 
properties:'^  MSDE  I n,G(0),Ao]  <  MSDE{$|n,  G(0),  Aa]  Vn  <  oo.  This  is  the  problem 
with  asymptotically  efficient  estimators  in  general:  we  know  they  do  good  things  with  large  sample  sizes, 
but  we  can’t  be  sure  that  there  isn’t  another  asymptotically  efficient  estimator  that  does  even  better  for  small 
sample  sizes. 

3J.1  Differential  Learning  Generates  Consistent  Classifiers 

When  u  =  I  in  (3.32) 

Jjin  DEnor[g\e^  =  Aa{«o|5".G(©))]  =  min  DEnor[e|0]  ,  (3.37) 

- - » - ' 

which,  owing  to  (he  convergence  properties  of  the  training  sample  (see  appendix  B)  and  the  assumption 
of  theorem  3.1,  holds  for  each  and  every  combination  of  training  sample  5”  and  initial  parameteriution 
$0.  As  a  result,  (3.37)  is  equivalent  to  (3.19),  and  differential  learning  generates  co..sistent  classifiers,  by 
definition  3.1 1  (page  59). 

'^We  lemM  ttie  leader  itw  r,'  Mympiaeically  effkkM  kaming  mtegy  mua  geneme  the  relatively  efTiciem  clastirier  for  la^ 
Baiiiins  ninpte  tizet,  regardless  of  ihe  choice  of  hypothesis  class.  To  be  sure,  there  are  probabilistic  learning  strategies  that  exhibit 
better  cotivergeace  properties  than  differential  learning  does  adien  G(0)  Si  G(B),  but  these  learning  strategies  provaMy  fail  to 
genetaie  the  relatively  efficient  classtfier  for  the  aibitrarily  chosen  G(6).  This  issue  is  the  subject  of  sections  3.4  and  3.6. 
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Equation  (3.37)  also  guarantees  that  the  discriminant  variance  of  the  differentially  generated  classifier 
converges  to  zero  for  asymptotically  large  training  sample  sizes.  As  mentioned  in  the  preceding  section,  the 
guarantee  does  not  set  a  bound  on  the  rate  of  this  convergence;  it  simply  states  that  I )  the  convergence  takes 
place,  and  2)  beyond  some  lower-bound  value  of  n,  the  differentially-generated  classifier’s  MSDE  is  the 
lowest  allowed  by  the  choice  of  hypothesis  class. 

The  value  of  this  lower-bound  on  n  is  determined  by  the  probabilistic  nature  of  X ,  over  which  the 
learning  machine  has  no  control,  as  well  as  the  choice  of  hypothesis  class  G(0) ,  over  which  the  learning 
machine  (or  the  person  controlling  it)  does  have  control.  Generally  we  have  access  to  training  sample 
sizes  that  are  orders  of  magnitude  too  small  to  ensure  convergence  of  the  training  sample  statistics  to  their 
underlying  cla.ss-conditional  probability  densities  and  a  posteriori  class  probabilities,  so  (3.30)  fails  to  hold 
as  an  identity  (rather  it  holds  only  as  an  approximation),  and  the  classifier’s  ability  to  generalize  (i.e.,  its 
MSDE)  is  quite  sensitive  to  the  complexity  of  G(0). 

In  order  to  prove  our  claim  that  differential  learning  is  efficient  (not  Just  asymptotically  efficient)  for  most 
choices  of  hypothesis  class,  we  prove  its  minimum-complexity  requirements  in  section  3.5  and  link  these  to 
Vapnik  and  Chervonenkis’s  seminal  work  relating  model  (i.e.,  hypothesis  class)  complexity  to  generalization 
(1371 1136,  ch.  6]. 

3.3.2  A  Word  Regarding  “Agnostic”  Learning 

In  concluding  this  section,  we  remind  the  reader  that  our  definition  of  asymptotically  efficient  learning  is 
unconditional,  in  that  it  places  no  restriction  on  the  probabilistic  nature  of  the  feature  vector  or  the  hypothesis 
class  from  which  the  classifier  is  generated.  Differential  learning  is  an  asymptotically  efficient  agnostic 
learning  strategy  because  it  generates  the  relatively  efficient  classifier  for  any  and  all  feature  vector/hypothesis 
class  combinations,  given  a  sufficiently  large  training  sample  size.  Tite  term,  "efficient  agnostic  learning," 
has  been  coined  recently  by  Kearns,  Schapire,  and  Sellie  (73,  72].  Although  our  definitions  of  efficient 
learning  are  more  universal  than  theirs  (in  that  definitions  3.16  and  3.17  pertain  to  not  just  one,  but  amost 
all  and  all  (respectively)  feature  vector/hypothesis  class  combinations  —  cf.  (73,  sec.  2.4]),'*  it  is  clear  that 
the  two  are  motivated  by  a  similar  philosophical  perspective.  Furthermore,  we  acknowledge  and  share  their 
notion  of  an  agnostic  learning  strategy  as  one  that  places  the  fewest  constraints  on  the  form  of  the  classifier’s 
discriminant  functions.  Since  differential  learning,  in  the  limit  that  the  CFM  objective  function  becomes 
a  step  function,  requites  the  least  restrictive  conditions  necessary  for  Bayesian  discrimination  (stated  in 
definition  2.2),  we  cannot  conceive  of  a  more  agnostic  learning  strategy. 

’’Moreover,  our  definition  of  efficient  learning  hai  its  basis  in  clas.<iical  estimation  theory,  whereas  Kearns  and  Shapire's  definition  has 
its  basis  in  theoretical  computer  science.  By  their  definition,  learning  is  efficient  if  it  exhibits  polynomial  time  and  sample  complexity. 


3.4  Discriminant  vs.  /•'unclional  Error 


3.4  Discriminant  Error  Versus  Functional  Error,  and  the  Inefficiency 
of  Probabilistic  Learning 

Readers  familiar  with  theconnectionistand  machine  learning  literature  (e.g.,  (136,  10.  59,  4 1,  146])  will  note 
that  our  definition  of  discriminant  error  is  very  different  from  the  definitions  of  functional  error  typically 
discussed.  Recall  that  section  2.3  describes  the  properties  of  many  functional  error  measures.  By  the  proofs 
of  that  section,  all  of  these  error  measures  are  associated  with  the  probabilistic  learning  strategy,  which  for 
asymptotically  large  training  sample  sizes  seeks  to  minimize  some  measure  of  functional  error  between  each 
discriminator  output  g,{X(ff)  and  its  corresponding  a  posteriori  probability  Pyv|x(^^f  1 X). 

In  the  preceding  section,  taking  the  expectation  of  the  step  form  of  the  CFM  objective  function  over 
all  feature  vectors  showed  that  CFM  constitutes  an  asymptotically  unbiased  estimator  of  one  minus  the 
classifier’s  error  rate.  Thus,  learning  by  maximizing  CFM  proved  to  be  asymptotically  efficient.  We  know  of 
no  error  measure  that  constitutes  an  unbiased  estimator  of  the  arbitrary  differentiable  supervised  classifier’s 
error  rate  for  the  general  C-class  pattern  recognition  task;  as  a  result,  we  know  of  no  probabilistic  learning 
strategy  that  is  asymptotically  efficient  according  to  definition  3.17. 

Consider  the  expected  value  of  mean-squared  error  over  all  feature  vectors,  described  by  (2.70): 


lim  MSE(5"|0)  =  Ex  [MSE{X|»)]  (3.38) 

P~*<X! 

I  ^  f 

=  2  /  [U.(x|0)-  I)'  •  Pvv|x(a;,|X) 

/=  I  '  ')C 

gi{x\ef  •  (»  -  P,vix(tt^<|X))]/ix(X)i/x 


^x 


{gi{x\e)  -  Pwix{(^.|x))' 

functional  error 


Pn|x(t^<  IX)^  -I-  P)vix(^^i|X) 


” 

Ex 

(g,(x|e)  -  p>vix(a;,|x))^ 

squared  functional  error 

-  Ex  (P>v|x{t<^i  |X)^]  +  Pw{u;,) 

> - ^ - 

constant 

DEiror  [0 1 0]  +  some  constant 


(3.39) 


Expressions  for  functional  bias  and  variance  can  be  derived  from  this  mean-squared  functional  error 
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expression  (see,  for  example,  [41,  sections  2  &  3)).  Learning  probabilistically  by  minimizing  the  MSB 
objective  function  ultimately  minimizes  the  squared  functional  error  between  the  discriminator  outputs  and 
their  corresponding  a  posteriori  class  probabilities,  as  shown  in  (3.39).  Since  this  measure  of  functional  error 
is  not  a  measure  of  discriminant  error,  minimizing  it  docs  not  guarantee  that  the  classifier's  error  rate  will  be 
minimized.'^  By  the  same  token,  functional  bias  and  variance  expressions,  while  very  useful  in  the  function 
approximation  context,  are  not  necessarily  relevant  to  the  pattern  recognition  context.  This  is  because  they 
bear  no  direct  relationship  to  discriminant  bias  and  discriminant  variance.  In  chapter  4  we  shall  see  that 
classifiers  can  exhibit  very  high  functional  bias  and  variance,  while  exhibiting  very  low  discriminant  bias 
and  variance.  In  chapter  5  we  shall  see  that  decreasing  a  classifier's  functional  error  can  have  the  undesirable 
effect  of  increasing  its  discriminant  error. 

In  fact,  all  error  measures  we  have  encountered  have  the  same  flaws  that  mean-squared  error  has;  this  is 
indicated  by  the  expected  value  of  the  general  error  measure  over  all  feature  vectors,  described  by  (2.37): 


^Ijm  EM(5"|tf)  =  Ex[EM(X|6l)] 

P-»oo 


=  i: 


I  lf(0-s,(X|«l)  P„|,(Wi|X) 

+ngi{X\e)  -  -.D)  .  (1  -  Pn.|x(W,|X))]/>x(X)r«; 


^  DError  [Q\0\  +  some  constant 


(3.40) 


In  short,  no  error  measure  we  know  of  is  monotonically  related  to  discriminant  error  for  the  general  C-class 
pattern  recognition  task  employing  the  arbitrary  differentiable  supervised  classifier  (see  section  5.3).  Thus, 
probabilistic  learning  cannot  be  asymptotically  efficient  by  the  (universal)  definition  3.17.  It  can,  however, 
produce  the  efficient  classifier  of  X  for  the  special  case  in  which  G(@)  =  G{0,X)prvper'-  this  is  the 
subject  of  section  3.6. 


Differential  Learning  Requires  the  Minimum-Compiexity  Classifier 


By  choosing  a  particular  hypothesis  class  G(9)  with  which  to  classify  X ,  we  reduce  our  possible  choices 
for  timdeling  the  BDF  of  X  to  subsets  of  the  four  forms  of  the  BDF  defined  in  section  2.2.2; 


'*No  doalM,  tome  readen  wilt  ditoetii  •  phikwophical  parallel  between  our  if|vmeM  tfainst  iiting  the  MSB  objective  function  for 
pnMiiliitic  leaniiaf  and  earlier  itfamentf  againet  using  MSB  as  s  panmeier  estimation  criterion.  These  earlier  arguments  are  piMished 
« the  cadmatioa  the^  Hleratura  of  the  past  three  decato  (see,  for  example,  (71. 108, 74]).  Of  ooune,  our  sigumM  is  not  against  the 
MSB  objective  lunction  per  se,  but  against  error  measme  ol^ective  functions  in  general,  since  they  engender  inefficient  learning.  Again, 
we  letara  to  the  assertion  that  minimizing  a  classifier's  functional  error  is  not  the  rame  as  minimizing  its  discriminant  error. 


.?.5  Differential  Learning  Requires  Mininmnt  Complexity 
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^  ® )  Bavfs-Sirictly  Protxihilixtic  — 


G(0) 


Rtiyeit-Prohihilixfic  — 


G{0) 


Bayes-Strictty  Differential  — 


G  {  0  )  Bayex-Differential  — 

i.C.,  ~ 


{^?(X|e)  :  g{X\e)  €  F Bayex-Strictfy  Prohahiiixtic  } 
{Q{x\e) :  G(X\e)  G  F BtnrX'Pmhtihilixtic  } 

{5(X|d)  ;  ^{X]®)  S  1?Ba\rs-Slnctly Differtniial} 

{^(XIO)  :  G(X\0)  €  F BayeX’Differential } 
G(e)Bayex  C  ^ Bavex-Differential  —  P Stiyvx 


By  (he  dennitions  of  section  2.2.2,  Ihe-se  sub-sets  of  the  hypothesis  cla.ss  are  related  as  follows; 


(3.41) 


G ( 0 ) Baw.f.S/nrtfv /’fnhaWft.tf/f  C  G{&)Bavts-Slncil\ Difffrrnual  _ 

(3.42) 

C  G(0)Ba,«  ■Prohahiiixtic  C  G{Q)B4iyftfDifferentiat  —  G(0)b«,«  c  Bayts 

All  of  the  subsets  of  G(0)  in  (3.41)  and  (3.42)  may  be  empty,  in  which  case  G(0)  does  not  contain 
a  Bayes-optimal  discriminator  of  X  (i.e..  {Q{X\6)  :  ^(X|0)  €  Fsom}  =  0)-  Regardless  of  the 
specific  nature  of  G(  0 )  and  whether  or  not  it  contains  a  Bayes-optimal  discriminator  of  X ,  it  certainly  does 
contain  a  discriminator  QiX  \  that  exhibits  the  minimum  discriminant  error  of  any  discriminator  in 

G(0): 

□Error  ^ =  min  DError  [Q  ( 0\  (3.43) 

0 

Recall  from  definitions  3.1 1  and  3.15  that  if  the  relatively  efficient  classifier  is  consistent,  its  discriminator 
will  converge  to  Q{X  \  )  as  the  training  sample  size  grows  asymptotically  large.  Under  this  condition, 

its  asymptotic  discriminant  variance  will  be  zero  and  its  asymptotic  MSDE  will  be  given  by 


lim  MSDE[^|n,G(0),A]  =  lim  (DBias(^|n.G(0). A])^ 


(3.44) 

Since  the  differentially-generated  classifier  is  both  relatively  efficient  and  consistent,  its  asymptotic  discrim¬ 
inant  bias  is  indeed  DError  [0 1  —  the  minimum  all‘'wed  by  G(© ). 

If  differential  learning  is  guaranteed  to  produce  the  least  biased  approximation  to  the  Bayes-optimal 
classifier  allowed  by  G(0)  for  asymptotically  large  training  sample  sizes,  then  we  might  consider  the  set 
of  all  hypothesis  classes  in  ^  for  which  DError  is  equal  to  some  specified  value  DBias,^: 
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spec 


=  < 


G(0)  ;  G(0)  J  n  DError  1 =  DBias,,«.( 


(1  431 


>  ; 


(3.45) 


0  <  DBias,,„r  <  I 


From  we  might  wish  to  choose  the  hypothesis  class  with  the  least  functional  complexity;  this  choice 

would  therefore  have  the  least  functional  complexity  necessary  to  classify  X  with  the  DBia.St^^  level  of 
asymptotic  discriminant  bias.  By  making  this  minimum-complexity  choice,  we  would  follow  the  maxim  of 
Occam’s  razor;  "the  simplest  model  is  the  best  one,"  [130). 

This,  of  course,  presumes  that  we  have  defined  an  acceptable  measure  of  functional  complexity.  At 
present,  a  universally  accepted  definition  remains  the  subject  of  debate.  For  this  reason,  we  employ  a  general 
non-specific  measure  of  functional  complexity. 


The  general  functional  complexity  measure  T  (•]  The  general  functional  complexity  measure  T  (•]  is 
a  well-defined  real  measure  T  [^(X  |6)]  6  (the  larger  the  measure,  the  higher  the  complexity).  The 
notation  [G(0)J  €  S  denotes  the  upper  bound  on  the  complexity  of  the  hypothesis  class  G(0) : 

[G(©)]  ^  max  T  (^(X 1 0)]  (3.46) 

B 

Remark:  Three  well-known  complexity  measures  that  satisfy  this  notion  of  the  general  complexity  measure 
are 

•  The  Vapnik-Chervonenkis (VC)  dimension  dimvc(  )  [137,  136). 

•  The  number  of  parameters  in  the  hypothesis  class  (i.e.,  the  dimensionality  of  parameter  space)  dim(0) 
(e.g.,[ll3.  1351). 

•  The  effective  number  of  parameters  in  the  hypothesis  class  (96, 97). 

Example  3J  The  hypothesis  class  G(0)  described  in  example  3. 1  (page  56)  has  the  following  complexity: 

.  If  [G(0)]  ^  dimvc  (iri(X  1 9)) .  then  the  complexity  measure  is  equal  to  the  sum  of 

the  VC  dimension  dimvc  (gi(X  |  ^))  for  each  of  the  discriminator’s  C  discriminant  functions.'"'  For 

'^See  [137],  [136,  ch.  6],  and  (100,  ch.  2.3]  for  detailed  detcriptiom  of  the  VC  dimemioii  and  its  computation.  To  the 
unlnitiaied.  we  lecoiianend  die  lovi^  summary  by  Abu-Moatafi  (l|.  Strictly  speaking,  the  VC  dimension  pertains  to  a  2-class 
pMein  recognition  task,  the  more  general  C'<lats  task  is  viewed  as  C  2<lass  tasks  in  which  each  of  the  C  discriminant  functions 

<?(X|d)  =  {gi(X|d) . gc(X|d)}  maps  the  feature  vector  to  the  binary  classification  U>i  or  (read,  "not  class  /”). 

The  VC  dimension  is  computed  for  each  of  the  C  discriminant  functkms  separately;  the  complexity  measure  does  not  sum  across 
discriminant  functions,  at  least  not  for  the  purpose  of  estimating  training  sample  sizes  necessary  for  good  generalization  [136,  ch.  6]. 

In  this  sense,  our  using  the  sum  of  each  diacrinunant  function's  VC  dimension  <  dimvc  (xrCX  1 0)) )  as  the  over-ail  congilexity 
measure  for  the  hypothesis  class  is  not  consistent  with  the  original  intent  of  the  measure.  We  simply  use  this  sum  as  a  convenient  way  to 
express  the  overall  complexity  of  the  hypothesis  class  with  a  single  number. 
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the  real  polynomial  discriminant  functions  described  by  (3.7),  the  VC  dimension  of  each  discriminant 
function  is  one  plus  the  maximum  number  of  real  roots  in  tlie  function’s  polynomial.  Thus, 
T«„[G(©)]  =  3  •  (10+  1)  =  33. 

•  If  Ttou  [G(0)]  =  dim{0) ,  then  the  complexity  measure  is  simply  the  total  number  of  parameters 
in  G(0) ,  which  in  this  particular  case  is  equal  to  the  sum  of  the  VC  dimension  for  each  discriminant 
function:  dim(d)  =  5^^,  dimvc  (^i(X|d))  =  3  •  ( 10  +  1 )  =  33. 

Given  one’s  preferred  measure  of  functional  complexity,  there  is  some  minimum-complexity  hy¬ 
pothesis  class  G(0  i)  €  ^spec  contains  a  discriminator  with  minimum  discriminant  error 
min^^  DError  [S  I  4-]  =  DBias.,p,c.  Since  the  differentially-generated  classifier,  given  the  hypothe¬ 
sis  class  G(0i),  exhibits  min^^  DError[^(fii]  for  asymptotically  large  training  sample  sizes,  the 
following  corollary  to  theorem  3.1  holds: 

Corollary  3.1  Differential  learning  requires  the  hypothesis  class  with  the  least  functional  complexity  neces¬ 
sary  to  approximate  the  Bayes-optimal  classifier  with  specified  precision.  The  precision  of  the  approximation 
is  measured  itt  terms  of  asymptotic  discriminant  bias  (i.e.,  the  discriminant  bias  of  definition  3.7,  given  an 
asymptotically  large  training  sample  size  n  oo). 

Proof  :  The  proof  follows  from  theorem  3. 1  (page  67)  and  the  preceding  set-theoretic  argument.  I 

Remark:  The  corollary  simply  states  that  differential  learning  requires  the  least  complex  model  necessary 
for  the  data.  This  is  generally  not  the  case  for  probabilistic  learning,  owing  to  the  relationships  of  (3.42).  As 
we  shall  see  empirically  in  chapter  4  and  theoretically  in  chapter  6,  it  generally  requires  substantially  more 
functional  complexity  to  ensure  that  G{0)Bayt.,  siriaiyPmhahiiimc  and  G{0)sov«-/’mftoM7.mc  are  not  empty  sets 
than  it  does  to  ensure  that  G{0)Ban,.Differeruiai  is  "ot  an  empty  set. 

Corollary  3.1  says  nothing  about  differential  learning’s  role  in  determining  the  minimum-complexity 
hypothesis  class  G{04.)  sufficient  for  Bayesian  discrimination.  By  Kolmogorov’s  theorem  [77]  there  is 
no  algorithm  short  of  exhaustive  search  for  determining  it  a  priori.  Instead  we  must  restrict  ourselves  to 
a  particular  a  priori  choice  of  hypothesis  class  G(0)  —  an  educated  guess,  which  we  believe  contains  a 
Bayes-optimal  classifier  of  X.  In  making  this  choice  of  G(0) ,  we  must  weigh  its  discriminant  bias  for 
large  sample  sizes  against  its  discriminant  variance  for  small  sample  sizes  —  the  ubiquitous  bias-variance 
tradeoff  that  every  estimator  exhibits.  A  high-complexity  hypothesis  class  will  ensure  low  asymptotic 
discriminant  bias  at  the  cost  of  high  discriminant  variance  for  small  training  sample  sizes.  A  low-complexity 
hypothesis  class  will  ensure  low  discriminant  variance  for  small  training  sample  sizes,  but  will  it  exhibit 
low  discriminant  bias  (i.e.,  a  low  error  rate)?  Differential  learning  guarantees  that  it  will  exhibit  the  lowest 
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(Ktssible  error  rate  for  large  training  sample  sizes  (theorem  3.1);  this  in  turn  guarantees  that  we  can  attain 
a  specific  error  rate  (greater  than  or  equal  to  the  Bayes  error  rate)  with  the  least  complex  hypothesis  class 
necessary  (corollary  3.1).  Since  discriminant  variance  increases  with  the  complexity  of  the  hypothesis  cla.ss 
under  VC  analysis,  theorem  3.1  and  corollary  3.1  assure  us  that  we  can  achieve  both  low  discriminant  bias 
and  low  discriminant  variance  by  pairing  a  low-complexity  hypothesis  class  with  differential  learning. 

Because  the  VC  dimension  is  a  complexity  measure  that  satisfies  the  general  characteristics  described 
above,  corollary  3.1  asserts  that  differential  learning  requires  the  hypothesis  class  G(0i.)  with  the  smallest 
VC  dimension  necessary  to  approximate  the  Bayes-optimal  classifier  with  a  specified  level  of  precision. 
Consequently,  differential  learning  allows  us,  under  VC  analysis  [137]  (136,  ch.  6),  to  minimize  the 
probability  that  the  classifier’s  worst-case  failure  to  generalize  (i.e.,  the  worst  case  deviation  of  the  ck  sifier’s 
empirical  training  sample  error  rate  from  its  true  error  rale)  will  exceed  an  unacceptable  level.  This  level  is 
specified  by  e  in  (137,  Theorem  2).  It  is  worth  noting  that  this  minimal  worst-case  bound  does  not  stem 
from  the  efficiency  of  differential  learning;  it  stems  solely  from  the  minimum  complexity  requirements  of 
differential  learning.  Thus,  there  are  two  ostensibly  independent  mechanisms  by  which  the  differentially 
generated  classifier  generalizes  well;  the  efficiency  of  the  learning  strategy  itself  and  its  minimum-complexity 
requirements. 

The  only  question  that  remains  is  whether  there  is  ever  a  learning  strategy  with  better  convergence 
properties  than  differential  learning  (/\^  ),  given  a  particular  low-complexity  hypothesis  class.  That  is, 
under  what  conditions  might  there  be  a  specific  hypothesis  class  G(0) ,  learning  strategy  A ,  and  feature 
vector  X  combination  for  which 

MSDE(g|n,G{0),A]  <  MSDE(a|n.G{0).AA]  ,  n  «  cx. 

&  (3.47) 

lim,_*^.  MSDE[^|/j,G(0),A]  =  MSDE[ein,G(0),AAl 

We  know  ofonly  one  such  case:  if  G(0)  is  a  proper  parametric  mode)  of  X  and  A  is  a  maximum-likelihood 
probabilistic  learning  strategy,  (3.47)  can  hold. 

3.6  The  Case  for  Probabilistic  Learning 

At  the  end  of  chapter  2  the  reader  might  have  wondered  why  differential  learning  would  be  desirable.  By  this 
point,  the  reader  might  wonder  Just  the  opposite;  why  use  probabilistic  learning  if  it  is  inefficient?  Although 
we  argue  in  favor  of  differential  learning  in  most  cases,  we  believe  there  are  at  least  three  scenarios  under 
which  probabilistic  learning  can  be  preferable: 

•  Probabilistic  learning  is  preferable  when  one  specifically  wants  to  estimate  the  a  posteriori  class 
probabilities  of  X.  By  using  probabilistic  learning,  we  are  obliged  to  choose  a  hypothesis  class  vith 
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sufficient  functional  complexity  to  approximate  the  a  posteriori  class  probabilities  of  X  with  high 
precision.  This,  in  turn,  generally  dictates  very  large  training  sample  sizes  for  robust  probabilistic 
estimates  (see  chapter  6). 

•  Probabilistic  learning  might  be  desirable  when  one  class  is  always  more  likely  than  any  other,  regardless 
of  X.  As  an  example,  there  might  be  no  combination  of  clinical  factors  for  which  the  probability  of 
death  exceeds  the  probability  of  surviving  coronary  bypass  surgery,  yet  there  will  be  combinations  of 
clinical  factors  for  which  the  probability  of  death  is  relatively  high.  A  physician  counselling  bypass 
surgery  candidates  might  therefore  want  a  robust  probability-of-mortality  estimate  for  each  patient. 
Again,  if  probabilities  must  be  estimated,  we  are  obliged  to  satisfy  the  complexity  and  training  sample 
size  requirements  of  probabilistic  learning. 

•  Probabilistic  learning  might  be  preferable  if  the  hypothesis  class  G(0)  is  a  proper  parametric  model 
of  X  (see  definition  3.13). 

The  first  two  scenarios  involve  a  subjective  choice;  the  third  scenario  does  not.  Rather  it  stems  from  the 
existence  of  the  proper  parametric  model. 

The  proper  parametric  model  G{&,X)p„^r  employing  a  consistent  form  of  probabilistic  learning  clearly 
satisfies  the  condition  of  (3.47)  —  namely  that  its  asymptotic  MSDE  when  generated  probabilistically  is 
equal  to  its  asymptotic  MSDE  when  generated  differentially: 

MSDE  [g  I H,  G(0,  Ap]  =  MSDE  [^  |  n,  G(0,  X)proprr,  A  a]  (3.48) 

As  we  mentioned  earlier  in  this  chapter,  a  rigorous  proof  that  (3.47)  holds  when  G(0)  =  G(0,X)pmp,r  is 
beyond  both  our  interest  and  stamina.  Instead,  we  merely  hypothesize  why  (3.47)  holds,  sketch  a  proof,  and 
describe  one  particular  case  (subsequently  illustrated  in  chapter  4)  for  which  the  proof  holds. 

Hypothesis  3.1  If 

1.  the  proper  paramuric  model  G(0,  X)pnptr  of  X  exists,  and 

2.  maximum-likelihood  estimators  of  its  parameters  exist  (which  we  obtain  via  the  maximum-likelihood 
probabilistic  learning  strategy  Ap.ml  ).  ond 

3.  the  variance  of  each  maximum-likelihood  parameter  estimator  matches  the  Cramer-Rao  bound  [1071 
(22,  ch’s.  32-33], 

then  the  classifier  T  (^(X  |Oml)  generated  from  G(0,  X)pmp,r  by  the  maximum-likelihood  probabilistic 
learning  procedure  Ap.ml  it  the  relatively  efficient  classifier  of  X  (definition  3. 1 5)  for  all  training  sample 
sizes  (i.e.,  Vn).  if,  in  addition,  G{0,X)im^,  is  the  minimum-complex’ty  hypothesis  class  containing  a 
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Bayes-optimal  classifier  of  X ,  then  V  {Q(X\0ml)  constitutes  the  efficient  classifier  of  X  (definition  3.12) 
for  all  training  sample  sizes. 

Sketch  of  Proof  :  By  [107]  [22,  ch’s.  32-33],  Jhe  maximum-likelihood  estimate  ^mi.  of  the  parameter 
vector  0"  in  (3.21)  exhibits  the  lowest  possible  variance  of  any  estimator  of  0‘  for  any  training  sample 
size  n.  Thus  0ml  converges  to  0'  with  probability  one  faster  than  any  other  estimator  of  0".  This  fastest 
convergence  implies  that  DBias  [0  |  n,  G(0,  X.)praper.  Ar-ml]  converges  to  zero  with  probability  one  faster 
than  it  does  for  any  other  estimator  of  the  Bayes-optimal  classifier  obtained  from  (j(0,X)pr„per.  By  the 
definitions  of  section  3.2,  the  maximum-likelihood  probabilistic  learning  strategy  Ar-ml  thcrefoie  generates 
the  relatively  efficient  classifier  of  X  from  G(@,  X)pmper  for  all  training  sample  sizes; 

MSDE[g|n,G(0,X)p„p,„Ap.ML]  <  MSDE  [^  |«.  G(0,X)p™;„..  A] 

(3.49) 

Vn,  VA  €  It, 

If,  in  addition,  [G(0,X)^„;vr]  is  the  lowest  complexity  of  any  hypothesis  class  containing  a  Bayes- 
optimal  classifier  of  X,  then  Ar.ml  generates  the  efficient  classifier  of  X  from  G(0,X)prflp^r  for  all 
training  sample  sizes; 


MSDE[^|n,G(0.X)^.„p,MAp.ML]  <  MSDE[^|»i,G(0)',AJ 
Vn,  VA  €  VG(0)'  e  % 


(3.50) 


□ 


Renuirk:  Hypothesis  3.1  is  not  a  theorem,  and  the  preceding  argument  is  not  a  proof,  owing  to  the  lack  of 
rigor  on  two  points; 


•  First,  it  is  not  clear  (although  it  may  seem  intuitively  sensible)  that  fastest  convergence  in  the 
discriminator's  parameters  necessarily  guarantees  fastest  convergence  in  the  classifier’s  error  rate. 
It  is  possible  to  make  this  linkage  for  specific  proper  parametric  models  (see  section  3.6.1),  but  it 
remains  unclear  whether  there  exists  a  proof  of  it  for  the  general  proper  parametric  model  satisfying 
the  requirements  of  hypothesis  3. 1 . 

•  Second,  it  is  easy  to  conceive  of  a  feature  vector  X  for  which  the  proper  parametric  model  exist,  but 
for  which  the  proper  parametric  model  is  not  the  minimum-complexity  hypothesis  class  containing  a 
Bayes-optimal  discriminator  of  X.  In  such  a  case,  it  might  be  possible  to  prove  that  the  classifier 
generated  differentially  from  the  minimum-complexity  improper  parametric  model  containing  a  Rayes- 
.i>iimal  discriminator  of  X  is  more  efficient  than  the  one  generated  from  the  proper  parametric  model 
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(i.e.,  that  the  improper  parametric  model’s  MSDE  converges  faster  than  that  of  the  more  complex 
proper  parametric  model). 

Example  3.4  The  normal-based  linear  discriminant  (i.e.,  Gaussian  maximum-likelihood)  and  logistic  re¬ 
gression  models  are  probably  the  best-known  example  of  a  hypothesis  class/leaming  strategy  combination 
(fully-parametric  and  partially  parametric,  respectively)  that  generates  the  efficient  classifier  of  X  when 
X  has  homoscedastic  Gaussian  class-conditional  pdfs  with  the  additional  nice  properties  described  in 
appendix  F.'*  Readers  familiar  with  both  traditional  logistic  regression  and  neural  network  models  will 
recognize  that  the  C-output  multi-layer  perception  with  logistic  nonlinearities  and  no  hidden  layer  units  is 
the  C-class  logistic  regression  model;  thi®  partially-parametric  model  and  its  associated  fully-parametric 
Normal-based  linear  discriminant  model  (e.g.,  (91,  ch.  3])  are  described  in  section  4.2  and  appendix  F. 
Maximum-likelihood  probabilistic  learning  for  the  partially-parametric  model  takes  the  form  of  minimizing 
a  Kullback-Leibler  information  distance  expression;  for  the  fully-parametric  model  it  takes  the  form  of 
minimizing  a  mean-squ^d  error  expression  (see  appendix  F).  The  fully-parametric  variant  is  somewhat 
more  efficient  than  the  paitiaily-parametric  variant  [30]  and  constitutes  the  efficient  classifier  of  X  when  X 
is  the  Gaussian  feature  vector  described  above  (see  section  4.2). 


3.6.1  Assessing  the  Asymptotic  Relative  Efficiency  (ARE)  of  a  non'Differential 
Learning  Strategy 


Equation  (3.47)  and  hypothesis  3.1  acknowledge  the  possibility  that  differential  learning  (yV^  )  is  not  the 
only  asymptotically  efficient  learning  strategy  for  hypothesis  classes  that  are  proper  parametric  models  of  the 
feature  vector  (or  close  approximations  thereto). 

Definition  3.18  Asymptotic  relative  efficiency;  Given  the  hypothesis  class  G{©)  and  two  learning 

strategies  A  and  A',  we  define  the  asymptotic  relative  efficiency  (ARE)  of  A'  with  respect  to  A  as  the 
ratio  of  the  MSDE  expressions 


are„^oo(A'.A|G(©)J  = 


MSDE  In, G(0),y\'] 
MSDE  [0 1 /I,  G{0),  A] 


(3.51) 


Remark:  This  ratio  is  a  generalization  of  the  efficiency  ratios  for  1)  the  general  estimator  [22,  sec.  32.3], 

and  .2)  fully-parametric  and  partially-parametric  classifier’s  of  the  Guassian  feature  vector  [30]  (see  sections 

4.2  and  F.3).  The  definition  focuses  on  classifiers  that  differ  only  in  the  learning  strategy  employed;  a 

Homoscedastic  pdfs  all  have  the  same  covariance  matrix.  Under  a  simple  linear  transfoi  inaiion,  a  feature  vector  with  homoscedastic 
Gaussian  class-ennditiona'  pois  nas  cias^-cunduionai  covariance  matrices  that  are  all  of  the  form  <r  ■  I,  where  I  denotes  the  identity 
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simple  generalization  of  (3.51)  allows  a  comparison  of  classifiers  (hat  differ  in  terms  of  their  hypothesis 
classes  as  well.  If  the  ARE  can  be  expressed  in  closed  form  for  the  proper  parametric  model  G(0,  X)pro/wr . 
the  differential  learning  strategy  and  the  maximum-likelihood  learning  strategy  Ap.wt*  't  tells  us 
which  strategy  learns  the  Bayes-optimal  classifier  faster  (i.e.,  with  fewer  training  examples).  The  answer 
lies  in  the  ratio  of  MSDE  for  the  two  learning  strategies,  expressed  in  terms  of  n  as  n  grows  large. 
If  ARE„_toc[Ap.ML.  Aa  1G{0, X)pr^Twr]  <  1  the  probabilistically-generatcd  (i.e.,  maximum-likelihood) 
proper  parametric  model  is  more  efficient  than  its  differentially-generated  counterpart  for  finite  training 
sample  sizes,  and  hypothesis  3. 1  is  substantiated  for  the  given  proper  parametric  model. 

The  notions  of  definition  3.18  and  (3.51)  require  some  explanation.  By  theorem  3.1  it  seems  that 
ARE„_+oc  (Ap-ML .  A  A  I  G(&,X)i,roptr\  should  never  be  less  than  unity,  since  we  have  proven  that  differential 
teaming  generates  the  relatively  efficient  classifier  for  asymptotically  large  training  sample  sizes  (recall 
(3.36)).  The  existence  of  cases  in  which  ARE„_»oc  (Ap.ml  .  Aa  I  G(©,  X)pmper]  <  1  would  seem  to  refute 
theorem  3. 1  by  contradiction  of  (3.36),  but  it  does  not.  The  explanation  lies  in  the  difference  between  a  limit 
and  the  rate  at  which  an  expression  converges  to  that  limit;  it  is  best  expressed  by  a  simple  example. 

Example  3.5  Let  us  assume  that  there  is  a  feature  vector  X  for  which  the  proper  parametric  model 
G(0,  X)prniwr  cxists.  Furthermore,  let  us  assume  that  the  following  MSDE  expressions  hold: 


MSDE[^|/(.G(0,X)^„^„Aa]  = 
MSDE[ein,G(0,X)p,^„Ap.MLl  =  n"'’’ 

Equation  (3.52)  guarantees  that  the  asymptotic  MSDE  exhibited  by  both  learning  strategies  is 


(3.52) 


£m  MSDE  [g  I  fi,  G{0,X)pr^r,  Ap.ml]  =  MSDE  [Q  \  n,  G{0, X)proprr,  Aa]  =  0 .  (3.53) 

even  though  the  ARE  of  the  maximum-likelihood  learning  strategy  Ap.ml  is  much  less  than  unity: 

ARE„_»oo(Ap.ml,Aa  |G(0,X)prflp«-J  =  1  (3.54) 

That  is,  both  learning  strategies  generate  the  Bayes-optimal  classifier  of  X,  but  the  probabilistically-generated 
classifier  converges  to  the  Bayes-optimal  <9  |^n~^  j  faster  than  its  differentially-generated  counterpart 

We  remind  the  reader  that  even  if  the  hypothesis  class  is  a  proper  parametric  model,  the  differentially 
generated  classifier  will  generalize  as  well  as  the  orobabilisticnliy-generated  (maximum-likelihood)  classifiei , 
as  long  as  the  training  sample  size  is  very  large.  If  (3.5 1 )  can  be  evaluated  for  a  particular  G(0,  X)pmper  and 
Ap.ML>  then  we  have  a  means  of  evaluating  whether  or  not  probabilistic  learning  is  preferable  (i.e.,  whether 
or  not  hypothesis  3. 1  is  substantiated  for  G(@,  X)prnper  and  Ap.ml  )  v'hen  the  training  sample  size  is  small. 
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3.6.2  A  Word  Regarding  “Proper  Models” 

The  term  “proper  model"  probably  originates  with  Dawes  [26].  From  the  statistical  pattern  recognition 
perspective,  he  uses  the  term  to  describe  a  parametric  model  of  the  feature  vector,  the  parameters  of  which 
are  learned  (i.e.,  estimated)  by  a  “proper”  — by  which  he  means  probabilistically  motivated  —  method. 
As  mentioned  above,  logistic  regression  is  an  example  of  a  proper  model  when  the  feature  vector  has 
homoscedastic  Gaussian  class-conditional  pdfs  with  the  additional  nice  properties  described  in  appendix  F. 
The  model  parameters  are  learned  by  the  “proper"  method  of  maximum-likelihood.  Dawes  characterizes 
an  improper  model  as  one  for  which  the  parameters  are  learned  by  some  “improper”  method  (i.e.,  one  that 
is  not  probabilistically  defensible).  He  argues  both  eloquently  and  persuasively  that  improper  models  often 
yield  better  pattern  classifiers  than  proper  ones. 

We  agree.  Indeed,  we  submit  that  the  proofs  of  this  and  the  preceding  chapter  explain  the  phenomenon: 

so-called  “proper"  probabilistic  learning  strategies  are  inefficient,  yielding  Bayesian  discrimination  only  if 

the  parametric  model  with  which  they  arc  paired  is  a  proper  one  for  the  feature  vector.  Dawes’  “improper” 

parameter  estimation  strategies  arc  superior  when  the  parametric  model  is  improper  by  our  definition  because 

they  arc  more  efficient,  generating  classifiers  that  exhibit  lower  MSDE  than  their  probabilistically-generated 

counterparts.  The  irony  is  that  probabilistic  learning  strategies  aren’t  always  the  best  ones  to  employ.  Indeed, 

it  motivates  us  to  restrict  the  adjectives  “proper”  and  “improper”  to  the  parametric  model  (i.e.,  hypothesis 

class)  alone.  In  our  view  there  arc  no  proper  or  improper  learning  strategies,  only  efficient  and  inefficient 
ones. 


3.7  Summary 

In  this  chapter  we  have  defined  the  efficient  classifier,  the  relatively  efficient  classifier,  the  efficient 
learning  strategy,  and  the  asymptotically  efficient  learning  strategy.  We  have  defined  the  mean-squared 
discriminant  error  of  a  classifier,  and  motivated  its  use  as  a  measure  of  generalization  —  how  closely  and 
how  consistently  the  classifier  approximates  the  Bayes-optimal  classifier’s  minimum  error  rate.  We  have 
proven  that  the  differential  learning  strategy  is  asymptotically  efficient,  guaranteeing  the  best  approximation 
to  the  Bayes-optimal  classifier  allowed  by  one’s  a  priori  choice  of  hypothesis  class  when  the  training  sample 
size  is  asymptotically  large.  Moreover,  we  have  proven  that  differential  learning  requires  the  minimum 
classifier  complexity  necessary  for  Bayesian  discrimination.  We  have  proven  that  probabilistic  learning 
strategies  are  usually  inefficient,  failing  to  generate  the  best  approximation  to  the  Bayes-optimal  classifier 
for  ail  but  possibly  one  choice  of  hypothesis  class,  regardless  of  the  training  sample  size.  Furthermore, 
probabilistic  learning  generally  requires  more  than  the  minimum  classifier  complexity  necessary  for  Bayesian 
discrimination. 

Viewing  all  differentiable  supervised  classifiers  as  parametric  models  of  the  feature  vector,  we  have 
distinguished  between  proper  and  improper  parametric  models.  We  have  shown  that  if  the  proper  parametric 
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model  of  the  feature  vector  exists,  it  is  possible  for  probabilistic  learning  to  generate  a  better  approx  imn'ion  to 
the  Bayes-optimal  classifier  than  differential  learning  can  when  then  training  sample  size  is  small.  Given  the 
explicit  desire  to  estimate  the  a  posteriori  probabilities  of  the  feature  vector  or  a  reasonable  likelihood  that 
the  hypothesis  class  constitutes  a  proper  parametric  model  —  as  determined  by  traditional  hypothesis  testing 
procedures  (e.g.,  see  [140])  — probabilistic  learning  is  the  preferred  strategy.  Absent  these,  differential 
learning  is  the  best  strategic  choice,  provably  requiring  the  simplest  model  of  the  data  and  the  smallest 
training  sample  size  necessary  for  good  generalization. 


Chapter  4 


The  Robust  Beauty  of 
Differentially-Generated  Improper 
Parametric  Models ' 

Outline 

We  analyze  two  “toy  ”  problems  in  order  to  make  the  theoretical  arguments  of  chapters  2  and  3  more  tangible. 
We  begin  with  a  familiar  2-ciass  Gaussian  pattern  recognition  task;  we  illustrate  that  the  probabilistically- 
generated  classifier  can  be  more  efficient  than  its  differentially-generated  counterpart  for  small  training 
sample  sizes  if  the  hypothesis  class  is  the  proper  parametric  model  of  the  feature  vector.  We  contrast  that  task 
with  a  simple  3-class  pattern  recognition  task  in  order  to  illustrate  that  differential  learning  is  asymptotically 
efficient  and  requires  the  minimum-complexity  classifier  necessary  for  Bayesian  discrimination.  The  analysis 
confirms  that  differential  learning  generates  the  relatively  efficient  classifier  for  small  as  well  as  large  training 
sample  sizes  when  the  hypothesis  class  is  not  a  proper  parametric  model  of  the  feature  vector.  Probabilistic 
learning  fails  to  generate  the  relatively  efficient  classifier  —  regardless  of  the  training  sample  size  —  when 
the  hypothesis  class  is  an  improper  parametric  model. 

4.1  Introduction 

The  purpose  of  this  chapter  is  to  illustrate  the  theoretical  points  of  chapters  2  and  3  so  that  the  reader  will 
gain  an  intuitive  appreciation  of  differential  learning  —  a  more  tangible  understanding  of  the  arguments  we 
have  made  so  far.  We  analyze  two  “toy”  problems  in  order  to  illustrate 

•  the  asymptotically  efficient,  minimum-complexity  nature  of  differential  learning, 

•  the  generally  inefficient,  high-complexity  nature  of  probabilistic  learning,  and 

•The  title  of  this  chapter  is  inspired  by  R.  M.  Dawes’  paper  The  Rnbum  Beauty  of  Improper  Linear  Models  |26J.  This  chapter  is  a 
revised  and  extended  version  of  worli  first  published  in  (S2|. 
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•  the  special  circumstances  under  which  probabilistic  learning  can  be  efficient. 

We  illustrate  these  characteristics  by  contrasting  a  pattern  recognition  task  for  which  the  chosen  parametric 
model  is  proper  with  a  task  for  which  the  chosen  parametric  model  is  improper. 

We  begin  with  a  familiar  2-class  pattern  recognition  ta  for  which  the  single  feature  is  a  homoscedastic, 
Gaussian-distributed  random  variable;  we  learn  to  classify  this  random  feature  with  its  proper  parametric 
model  in  both  full  and  partial  forms,  showing  that  probabilistically-generated  variants  are  more  efficient  than 
the  differentially-generated  variant  when  the  training  sample  size  is  small.  All  learning  strategies  generate 
equally  efficient  classifiers  from  the  proper  parametric  hypothesis  class  as  the  training  sample  size  grows 
large.  We  contrast  this  result  with  a  simple  3-class  pattern  recognition  task  for  which  the  single  feature  is 
a  heteroscedastic,  uniformly-distributed  random  variable;  we  leam  to  classify  this  random  feature  with  a 
polynomial  classifier,  which  is  an  improper  parametric  model,  showing  that  differentially-generated  variants 
are  always  more  efficient  than  their  probabilistically-generated  counterparts  for  both  small  and  large  training 
sample  sizes.  The  3-class  task  also  illustrates  the  minimal  complexity  requirements  of  differential  learning. 

Both  of  these  illustrative  tasks  lend  themselves  to  closed-form  analysis,  so  the  classifiers’  leam- 
ing/classification  characteristics  can  be  derived  for  asymptotically  large  training  sample  sizes.  We  analyze 
the  classifiers’  characteristics  for  small  training  sample  sizes  via  simulations.  In  effect,  we  play  the  role  of  an 
oracle  like  the  one  described  in  section  3.2.  Each  classifier/Ieaming  strategy  that  we  analyze  learns  repeatedly 
over  ten  independent  trials.  In  each  trial  all  the  different  classifier/leaming  strategies  leam  the  same  training 
sample  of  size  n ;  the  training  sample  for  each  trial  is  drawn  independent  of  all  other  training  samples.  We 
compute  the  true  error  rate  for  each  classifier  at  the  end  of  each  trial,  using  the  exact  expressions  for  the 
feature’s  a  posteriori  class  probabilities  and  class  prior  probabilities;  we  compile  the  results  for  all  trials. 
Each  classifier’s  posterior  parameterization  (and,  as  a  result,  its  error  rate)  varies  from  trial  to  trial  when 
the  training  sample  size  n  is  finite.  When  n  is  infinite,  each  classifier's  error  rate  is  a  constant  (i.e.,  its 
discriminant  variance  is  zero),  owing  to  the  convergence  properties  of  the  random  feature  (see  appendix  B) 
and  the  consistency  of  each  classifier.^  We  obtain  approximate  values  of  each  classifier’s  discriminant  bias, 
discriminant  variance,  and  mean-squared  discriminant  error  (MSDE)  by  computing  the  sample  mean  and 
sample  variance  of  the  classifier’s  error  rate  across  the  ten  trials  for  each  training  sample  size.  Thus,  the  only 
difference  between  the  experimental  protocols  of  this  chapter  and  the  protocol  for  the  oracle  in  example  3.2 
(page  61)  is  that  we  run  a  finite  —  as  opposed  to  an  infinite  —  number  of  independent  leaming/test  trials 
for  each  classifier. 

We  emphasize  the  differences  between  differential  and  probabilistic  learning  strategies  by  contrasting 
the  results  of  these  two  experiments.  In  the  process  of  evaluating  the  2-class  proper  parametric  model,  we 
demonstrate  experimental  results  that  are  consistent  with  Efron’s  theoretical  comparison  of  the  logistic  and 

^  All  of  Ihe  learning  strategies  we  employ  lead  to  consistent  classifleis  (see  definition  3.11).  The  differentially-generated  classifier  is 
proven  to  be  consistent  in  section  3. .3.1;  ihe  probabilistic  consistency  proofs  proceed  along  the  lines  of  the  differential  proof;  we  leave 
the  details  to  the  interested  reader. 


4. 2  Proper  Parametric  Model 


85 


normal-based  linear  discriminant  analysis  paradigms  (30].  More  importantly,  our  experiments  reflect  the 
theoretical  findings  of  chapters  2  and  3:  differential  learning  always  generates  the  most  efficient  classifier 
allowed  by  the  hypothesis  class,  given  a  sufficiently  large  training  sample;  probabilistic  learning,  in  contrast, 
fails  to  generate  the  most  efficient  classifier  allowed  —  regardless  of  the  training  sample  size  —  unless  the 
hypothesis  class  is  a  proper  parametric  model  of  the  data. 

4.2  Analysis  of  a  Proper  Parametric  Model 

Figure  4. 1  illustrates  a  two-class  scalar  x  with  homoscedastic^  Gaussian  class-conditional  pdfs  for  classes 
UJ\  and  U}2-  There  is  one  class  boundary  =  0)  for  the  Bayes-optimal  classifier  of  x.  The 

class-conditional  pdfs  of  x  are  given  by 


Px\\v  -  I'if] 

exp[-2^(x  -  ,12)^]  ;  (4.1) 

fi[  =  —1,65,  /U2  =  1-65,  (T^  =  1 

The  class  prior  probabilities  are  PwiU>t)  =  Pw{W2)  =  5.  The  aposfcrioriclass  probabilities  of  x  are 

Pw|x(<^i|-r)  =  [1  +  exp[-5^  (2x(//|  -  p2)  -  rii  +/'2)]]  ~' 

Pw|.t(t*^2U)  =  I  -  Pw|i  I-*) 

(4.2) 

=  [1  +  exp[-2^  (2x(//2  -  /m)  +  p]  -f'D]]  '  ; 

P\  =  — 1.65,  //2  =  1.65,  <7^  =  I 

Given  the  values  of  p\,  p2 .  and  .  the  Bayes  error  rate  rate  is  4.9%  when  the  following  classification 
strategy  is  employed: 


X  <  Si,2Ba>w.  choose  O’! 

X  >  B\,2Bayt,.  iOOSe  022 


B\,2Bans  =  0 


(4.3) 


4.2.1  The  Proper  Parametric  Model 

We  employ  two  forms  of  the  same  "logistic  linear”  classifier^  to  learn  the  Bayes-optimal  classifier  of  x.  The 
first  form  is  the  fully-parametric  proper  model;  the  second  is  the  partial ly-parametric  proper  model. 

-^Recalt  that  homoscedastic  pdfs  all  have  the  same  variance  parameter  (or  covariance  matrix). 

^See  section  7.2.2. 
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Figure  4.1:  A  two-class  scalar  feature  discrimination  task.  The  single  feature  is  a  homoscedastic,  Gaussian- 
distributed  random  variable.  Top:  class-conditional  density  -  class  prior  products  /)x|w(^l^')  '  Pw(^i); 
those  for  class  6J|  are  shown  in  dark  gray;  those  for  class  lOz  are  shown  in  white;  the  region  of  overlap  is 
shown  in  light  gray.  Bottom:  the  a  posteriori  class  probabilities  Pw|x(^^i  U)  and  Pw|x(t^2  U)- 


Figure  4.2:  The  proper  parametric  model  of  x.  The  logistic  linear  hypothesis  class  follows  ftom  both  the 
paTtially-parametric  and  fully-parametric  proper  models  of  jr.  Discriminant  functions  ate  shown  for  three 
parameterizations  that  yield  Bayes-optimal  discrimination  of  x.  The  dashed  line  denotes  the  parameterization 
by  which  the  discriminant  functions  are  identically  the  a  posteriori  class  probabilities  of  x.  The  solid  lines 
denote  two  different  parameterizations  by  which  the  discriminant  functions  partition  feature  space  in  the 
Bayes-optimal  fashion;  note  that  neither  of  these  parameterizations  yields  discriminant  functions  that  ate 
identically  the  a  posteriori  class  probabilities  of  x. 
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The  fully>parametric  proper  model  describes  x  in  terms  of  its  class-conditional  pdfs;  its  parameters 
model  the  unknown  class-conditional  means  and  the  unknown  variance  parameter  of  x : 


where 


gi{x\^) 


/’tivv  (■* 

\u),  ,/(|  ,0 

1 

Px\W  1 

1  +  Px[V.  1 

x\UJ2,P2,tt^^ 

=  [1  +  exp[^  Un  -  m)x  +  ^  -/»^)]] 

=  I  - 


Px\n- 1 

x\(jJ2,(l2,(t^ 

Pr\W  1 

xl(jJi,pi  ,0 

"  j  +  f’rlVV  (• 

x\(jJ2,P2,<t^^ 

=  [>  +  expf-Jr  (Pi  -  P\)x  -  ^  (//^  -f'2)]] 


Px\w{A<xJi,pi,aA  =  --4=^  exp 
V2na^ 


(4.4) 


(4.5) 


The  parameter  vector  d  in  above  denotes  the  three  parameters  {p\  ,p2><x^}- 

The  classifier  described  by  (4.4)  is  depicted  in  figure  4.2.  It  is,  by  definitions  3.13  and  3.14,  a  proper 
parametric  model  of  x  because  the  discriminant  functions  of  (4.4)  are  identically  equal  to  the  a  posteriori 

class  probabilities  of  (4.2)  when /Tf  =  =  P2<and  <7^  =  More  specifically,  (4.4)  describes  the 

JuUy-parametric  proper  model  of  x,  generally  called  the  normal-based  linear  discriminant  analysis  paradigm 
in  the  statistical  pattern  recognition  literature  (e.g.,  [91]).  We  denote  this  model  with  the  initials  "ML"  since 
the  model  learns  by  the  method  of  maximum-likelihood,  described  in  detail  for  the  general  homoscedastic 
Guassian  feature  vector  in  section  F.  1 .  The  resulting  maximum-likelihood  parameters  are 


Pi  =  -  T  tMx> ,W^))  ■  x' 

^  ^  ^  - /r,)^ 

n 


(4.6) 


where  xl  denotes  the  ^h  example  of  x  (as  opposed  to  the  /th  power  of  x,  which  we  denote  by  {x)i ).  Note 
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{I .  =  LUi 

(4.7) 

0 ,  otherwise 

and  n,  denotes  the  number  of  examples  having  the  class  label  in  the  training  sample  of  size  n. 

The  class  boundary  B\^2ML  formed  by  the  fully-parametric  (ML)  model  is  the  value  for  whi  ‘i 
g\{B\,2ML\9)  =  g2{Bt,2ML\0}  =  this occurs at 

«  _  /'I  +  rA 

Fully -Parametric 


The  partially>parametric  proper  model  '^escribes  x  in  terms  of  its  a  posteriori  class  probabilities;  its 
parameters  model  the  unknown  parameters  of  these  probabilities.  Given  the  following  definitions 


A  P2  -  PI 

/?  ^  /*!  ~ 

'  ~  2a2  ’ 

the  a  posteriori  class  probabilities  in  (4.2)  can  be  written  as  follows: 

Pvv|t  {0J\  1a:)  =  [1  +  exp[ax  +  /)]]"' 

Pw|r  1 1)  =  [I  +  exp[-o  X  -  ,/?]]  ■ 

The  partially-paramctric  proper  model  oi  x  is  given  by 


(4.9) 


(4.10) 


gi(j:l«)  =  fl  +  exp[0,,,x  +  0i,o]]  ' 
g2(A:|«)  =  [1  +  cxp[-0,,ixr  -  7i,o]]"' 


(4.11) 


where  0  =  {0i,o ,  0i,i } .  It  is,  by  definitions  3. 1 3  and  3.14,  a  proper  parametric  model  of  x  because  the 
discriminant  functions  of  (4. 1 1 )  are  identically  equal  to  the  a  posteriori  class  probabilities  of  (4.2)  and  (4.10) 

when  0\,\  —  a  =  and  0i,o  =  /3  =  specifically,  (4.4)  describes  the  partially- 

parametric  proper  model  of  x,  generally  called  the  logistic  discriminant  analysis  (or  logistic  regression) 
paradigm  in  the  statistical  pattern  recognition  literature  (e.g.,  [91]).  The  model  learns  by  the  method  of 
maximum-likelihood,  described  in  detail  for  the  general  homoscedastic  Guassian  feature  vector  in  section  F.2. 
The  resulting  maximum-likelihood  parameters  cannot  be  expressed  'n  closed  form.  However,  section  F.2 
proves  that  these  parameters  are  obtained  by  minimizing  the  Kullback-Lcibler  information  distance  (CE 
[82, 8 1 1,  see  section  2.3.2)  between  the  discriminant  functions  and  their  corresponding  empirical  a  posteriori 
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class  probabilities  over  the  domain  of  jc.  The  proof  has  been  worked  previously  by  Akaike  and  White 
[2,  140,  141]  and  by  Hjort  [65],  We  denote  the  partially-parametric  model  generated  probabilistically  via  the 
maximum-likelihood/Kullback-Leibler  information  distance  learning  procedure  with  the  initials  “CE”. 

The  class  boundary  i3|,2  ce  formed  by  the  partially-parametric  model  is  the  value  for  which 
«i(Si,2Ce|^)  =  g2(/3i,2c£|®)  =  5;  this  occurs  at 


B 


1.2 


PaninUy-Parantfiric 


—0\fl 

^1,1 


(4.12) 


4.2.2  Probabilistic  Learning  for  the  Asymptotically  Large  Training  Sample 

For  the  asymptotically  large  training  sample  size  (i.e.,  n  -¥  oc  ),  the  maximum-likelihood  parameters  of  the 
fully-parametric  (ML)  proper  model  are 


Pi  =  Pi 


lim  P2  -  Pi 

n-^oc* 


(4.13) 


(see  section  F.  1 ).  By  (4.2)  and  (4.8),  lim„_»oc  Bi,2ml'=  ^1,2  Sow  =  0,  and  the  ML  classifier  exhibits 
Bayesian  discrimination. 

For  asymptotically  large  training  sample  sizes,  the  maximum-likelihood  parameters  of  the  partially- 
parametric  model  are  given  by 


6i,i  =  «  = 


PI  -  Pi 


lim 

fi— foe 


(j  _  ^  _  f'l  -  Pi 

»'.0  -  «  - 


(4.14) 


(see  section  F.2).  By  (4.2)  and  (4.12).  iimn-»oo  ^\,iCE  =  Bx^isayts  =  0,  and  the  CE-generated 
partially-parametric  proper  model  exhibits  Bayesian  discrimination. 

Since  the  partially-parametric  proper  model  constitutes  a  differentiable  supervised  classifier  (defini¬ 
tion  2.8,  page  25),  it  can  be  generated  with  any  error  me.tsure,  not  just  CE.  We  denote  the  general  error 
measure  by  "EM”.  We  denote  the  training  sample  of  size  n  by  5",  and  we  denote  a  particular  unique  value 
(or  pattern)  of  x  by  Xp.  If  there  are  P  unique  patterns  in  5" ,  and  for  each  of  these  patterns  there  are 
examples  belonging  to  class  UJj,  the  sample  EM  is  given  by* 


^Please  see  section  2.3.1  for  specific  constraints  on  the  fonns  of  flD  —  jt,(x|fl))  and  f[fti(x\6)  -  -•D) . 
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C=2  P 


EM(5"|e)  =  EE7 

1=  I  p—  I 


f{D  -  «,(X;,|0))  + 


(Wp  rtp,,) 


■/(«/(Xp|0)  -  -D) 


p 

Y.  = " 

p=i 

(4.15) 

From  section  2.3.1,  we  know  that  the  classifier's  EM  can  be  expressed  by  the  following  expectation  as  the 
training  sample  size  grows  asymptotically  large: 


E,[EM(x|0)]  = 


C=2 

Y  ~  ■  Ph-|x(‘^.  U) 

1=1 


+ngi(x\e)  -  ^D)  •  (I  -  Pw|x(a;,|^))] 


Pxix)  dx 


(4.16) 


The  parameterization  B'  that  minimizes  the  classifier’s  EM  for  the  asymptotically  large  training  sample 
size  can  be  found  by  substituting  the  discriminant  function  expressions  of  (4.1 1)  into  (4.16),  deriving  the 
expression  for  the  gradient  Vg  (E,  [EM  (x|®‘)]),settingthis  gradient  equal  to  the  zero  vector, and  solving 
the  resulting  normal  equations  (see  section  2.3. 1 ).  Since  the  partially-parametric  model  is  proper,  and  since  the 
general  error  measure  is,  by  definition  (see  section  section  2.3. 1 ),  minimal  when  g,(x  1 0)  =  Pw|r  (Cd,- 1  x)  Vi, 
the  general  error  measure  generates  the  parameters  of  (4.14)  for  asymptotically  large  training  sample  sizes. 
By  (4.12),  the  general  error  measure  therefore  generates  the  Bayes-optimal  classifier  from  the  partially- 
parametric  model,  given  an  asymptotically  large  training  sample;  that  is,  limn-»oc  B\,2EM  =  =  0i 

and  the  EM-generated  partially-parametric  proper  model  exhibits  Bayesian  discrimination. 


4.2.3  Differential  Learning  via  CFM  for  the  Asymptotically  Large  Training  Sample 

The  partially-parametric  proper  model  can  also  learn  differentially.  Differential  learning  is  implemented  by 
maximizing  the  classification  figure-of-merit  (CTFM)  objective  binction  described  in  section  2.2.4,  chapter  S, 
and  appendix  D.  The  procedure  is  virtually  the  same  as  probabilistic  learning,  except  for  the  change  in 
objective  function.  From  section  2.4,  the  sample  CFM  is  given  by 


C=2  p 


CFM(5''|0)  =YY^ 

f=|  ^1 


[<5,(xp|®),t('] 


(4.17) 


# 


t 


# 
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In  this  C  =  2-class  case,  the  discriminant  differentials^  are  given  by 

Sx{x,\e)  =  2gaxp\e)  -  I 

Si{xp\e)  =  (4.18) 

=  \-2g,(xp\0), 

and  gi(jr|  0)  is  given  by  (4.1 1 ).  E)etails  of  the  CFM  confidence  parameter  0  are  given  in  appendix  D. 
The  classifier's  CFM  can  be  expressed  by  the  following  expectation  as  the  training  sample  size  grows 
asymptotically  large  (section  2.4): 

C=2 

E,  [CFM  {x\0)]=Y,  ^  I  Pvvix  {UJi  I  x)  p^(x)  dx  (4. 19) 


Section  2.4  proves  that  when  the  classifier  is  differentially-generated,  the  CFM  objective  function 
(l™^t_>o+  ^  maximized  when  the  top-ranked  discriminant  differential  (i.e.  »  ^(1)  (jT  1 0 ) )  corresponds  to  the 

most  likely  class  of  x  over  the  feature’s  domain:^  mathematically, 

lim  n— foo  Ejr  [CFM  (xrj  S’))  in  (4.19)  is  maximized  if  9'  is  such  that 
t/’-+o+ 

(4.20) 

<^(i)(->^l®‘)  =  Pvv|.t  k)  >  max  Pw|x 


Of  course  (4.20)  holds  only  if  the  classifier  has  sufficient  functional  complexity  to  yield  Bayesian  dis¬ 
crimination.  The  logistic  linear  classifier  does,  so  it  maximizes  CFM,  satisfies  (4.20),  and  constitutes  the 
Bayes-optimal  classifier  when  its  parameters  satisfy  the  following  constraints: 


=  0 
CFM  <  0 


(4.21) 


When  these  conditions  are  satisfied,  the  resulting  class  boundary  equals  the  Bayes-optimal  boundary  by 
(4.12): 

lim  B\2CFM  =  Bi,2  Bans  =  0  (4.22) 

#1^00 


^Recall  definition  2.7  and  section  2.4. 

^We  remind  the  reader  that  the  logistic  linear  classifier  for  the  C  =  2-class  pattern  recognition  task  has  only  one  discriminant 
function  gi(x  1 0);  we  create  a  phantom  second  discriminant  function  g2(x  1 0)  =  I  -  gi(x|  9)  for  the  purpose  of  computing  and  using 
the  discriminant  differentials  (x  1 9)  and  5j(x  1 9) ,  This  is  an  artifice  by  which  differential  learning  is  extended  to  the  single-output, 
differentiable  supervised  classifier. 
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4.2.4  Results  of  Differential  and  Probabilistic  Learning  for  Asymptotically  Large 
and  Small  Training  Samples 

A  word  regarding  error  rates:  Recall  from  definition  3. 1  (page  55)  that  the  true  error  rate  for 

the  classifier  of  x  is  given  by 

Pe (Q\e)  =  E,  [p, {Qix\e))]  =  p, (G{x\e))  dx  (4.23) 

where 

p,(^(x|d))  = 

The  error  rates  that  we  quote  in  this  chapter  — for  both  the  proper  and  improper  paramet¬ 
ric  models  — are  computed  according  to  (4.23),  since  we  play  the  role  of  an  oracle,  we  know 
the  probabilistic  nature  of  the  feature  x,  and  the  associated  integrals  are  tractable.  Specific  de¬ 
tails  of  the  error  rate  computations  for  this  section  and  section  4.3.4  are  given  in  appendix  G. 


Figure  4.3  displays  the  empirical  distribution  of  the  error  rates  for  four  logistic  linear  classifiers  of  x, 
based  on  ten  independent  leaming/testing  trials.  Statistics  are  shown  for  the  fully-parameUic  model  and  the 
partially-parametric  model;  the  latter  employs  two  forms  of  probabilistic  learning  (via  the  MSE  and  CE  error 
measure  objective  functions)  as  well  as  differential  learning  (via  the  CFM  objective  function).  Results  for 
differential  learning  via  CFM  are  shown  in  white;  results  for  probabilistic  learning  via  ML  (fully-parametric 
model)  and  CE  and  MSE  (partially-parametric  model)  are  shown  in  gray.  The  results  are  shown  in  box-plot 
[131,  ch.  2]  statistical  summaries.  In  brief,  the  box  of  each  plot  has  vertical  extrema  that  match  the  first  and 
third  quartiles  of  the  sample  data;  the  horizontal  line  dividing  the  box  delineates  the  median  of  the  sample 
data;  the  inner  and  (if  shown)  outer  “T”-shaped  “fences”  of  each  plot  depict  the  nominal  lower  bound 
of  the  first  quartile  and  nominal  upper  bound  of  the  fourth  quartile.  Extreme  values  in  the  first  and  fourth 
quartiles  falling  beyond  the  outer  fence(s)  are  plotted  as  dots  (see  appendix  C  for  details  of  the  box  plot 
statistical  summary).  All  results  for  finite  training  sample  sizes  are  based  on  10  independent  trials  for  the 
specified  training  sample  size  (all  classifiers  learn  the  same  training  sample  in  a  given  trial,  for  a  given  sample 
size).  Learning  for  the  fully-parametric  proper  model  is  a  simple  maximum-likelihood  parameter  estimation 
procedure,  the  computations  of  which  are  specified  by  (4.6).  For  the  partially-parametric  model,  learning 
takes  the  form  of  a  steepest  descent  (MSE,  CE,  etc.)  or  steepest  ascent  (CFM)  search  over  parameter  space, 
using  a  modified  form  of  the  backpropagation  algorithm  (e.g.,  [119,  120]);  learning  begins  from  a  tabula 


jy 


1  -  Pw\x{'I>(x\6)  |jr) 

1  -  Pw|,(r{a(x|(?))|x) 


(4.24) 
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rasa  state  in  which  all  parameters  are  initialized  randomly  according  to  a  uniform  distribution  on  the  closed 
interval  (—.3 ,  .3).  All  trials  are  completely  automated,  so  learning  is  done  without  any  human  intervention. 
All  experimental  conditions  (except,  of  course,  the  objective  function  used)  are  identical  for  differential 
and  probabilistic  learning.  Classifier  parameterizations  for  the  asymptotically  large  training  sample  size  are 
derived  as  described  in  sections  4.2.2  and  4.2.3. 

The  box  plots  of  figure  4.3  are  empirical  analogs  to  the  whisker  plots  of  figure  3.1.  That  is,  they  give 
us  approximations  of  the  discriminant  bias,  discriminant  variance,  and  mean-squared  discriminant  error 
(MSDE)  for  each  classifier/leaming  strategy.  TTiey  show  that  the  fully-parametric  proper  model  is  the  most 
efficient  estimator  of  the  Bayes-optimal  classifier  for  small  (10)  and  medium  (100)  training  sample  sizes. 
The  CFM-generated  partially-parametric  model  is  the  least  efficient  for  small  training  sample  sizes  (note  the 
one  trial  for  which  the  CFM-generated  classifier's  error  rale  is  27%,  depicted  as  a  dot  in  the  figure).  As 
the  training  sample  size  goes  to  100  examples,  all  the  partially-parametric  models  are  roughly  comparable, 
although  significantly  less  efficient  than  the  fully-parametric  model.  When  the  training  sample  size  increases 
to  1000,  all  the  classifiers  appear  comparable.  These  box  plots  are  noteworthy  for  two  reasons:  first,  they 
demonstrate  that  the  differentially-generated  classifier  is  indeed  asymptotically  efficient  (as  the  training 
sample  size  increases  beyond  1000  examples,  the  differentially-generated  model  is  as  good  as  any  of  the 
others);  second,  they  demonstrate  that  the  probabilistically-generated  proper  parametric  models  are  the  most 
efficient  classifiers  for  small  training  sample  sizes. 

This  second  finding  is  consistent  with  our  theoretical  description  of  the  circumstances  under  which 
probabilistic  learning  is  more  efficient  than  differential  learning  for  small  training  sample  sizes  (section  3.6). 
An  analysis  of  a  single  10-example  learning  trial  lends  a  qualitative  dimension  to  the  theoretical  description 
of  the  phenomenon.  Figure  4.4  shows  the  empirical  class-conditional  pdfs  —  empirical  class  prior  probability 
products  of  a  10-example  random  sample  of  x;  they  are  shown  in  histogram  form,  superimposed  on  the 
true  class-conditional  pdfs  —  class  prior  probability  products  of  x.  There  are  five  examples  of  each  class; 
all  the  examples  of  Uli  fall  to  left  of  x  =  —.75;  all  the  examples  of  U22  fall  fo  right  of  x  =  1.4. 
As  a  result,  there  is  an  interval  [-.75,  1 .4]  inside  which  there  are  no  training  examples.  Nevertheless,  the 
partially-parametric  proper  model  generated  probabilistically  from  these  ten  training  examples  (using  the  CE 
objective  function)  forms  a  class  boundary  B|,2C£  very  close  to  the  Bayes-optimal  boundary  Bi^2  Bayes  =  0. 
The  model’s  logistic  linear  discriminant  functions  and  the  partitioning  of  feature  space  they  produce  are 
shown  in  figure  4.5;  they  exhibit  a  5.1%  error  rate  —  a  good  approximation  to  the  Bayes  error  rate  of  4.9%, 
despite  the  small  training  sample  size  and  the  lack  of  any  examples  on  the  interval  [-.75,  1.4].  Because  it 
reduces  to  estimating  the  parameters  of  a  model  that  is  known  to  be  proper  in  this  particular  case,  probabilistic 
learning  via  CE  is  efficient  even  for  small  training  sample  sizes.  The  lack  of  training  examples  in  the 
vicinity  of  Bi^2  Bayes  IS  of  little  consequence  because  all  training  examples  contain  information  about  the 
class-conditional  means  p\  and  /ij  and  the  variance  parameter  of  the  partially-parametric  proper  model; 
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Figure  4.3:  A  comparison  of  error  rates  for  differentially  (CFM)  and  probabilistically  (MSE,  CE,  and  ML) 
generated  logistic  linear  classifiers.  Results  for  the  differentially  generated  classifier  are  shown  in  white; 
those  for  the  probabilistically  generated  classifiers  are  shown  in  ^y  (MSE  =  dark  gray,  CE  =  medium  gray, 
ML  =  light  gray).  Note  that  the  ML-generated  logistic  linear  classifier  is  the  fully-paramelric  proper  model 
of  jt;  the  CE-generated  logistic  linear  classifier  is  the  partially-parametric  proper  model  of  x. 


that  is,  good  estimates  of  these  parameters  are  possible  even  when  the  training  sample  contains  no  examples 
near  iBi,2iia,« .  Differential  learning,  on  the  other  hand,  does  not  exploit  the  proper  nature  of  the  logistic 
linear  discriminant  functions;  it  views  the  discriminant  functions  in  a  completely  ‘‘agnostic’’  way,  employing 
them  in  any  manner  that  classifies  the  training  sample  correctly.  Since  any  partially-parametric  proper  model 
with  the  parameters 


-.75  <  6\fiCFU  < 

<^l,l  CFM  <  0 


(4.25) 


will  classify  the  training  sample  without  error,  there  is  a  wide  choice  of  maximum-CFM  parameterizations 
for  the  classifier.  The  discriminant  functions  of  the  CFM-generated  partially-parametric  proper  model,  given 
these  10  training  exami^les,  are  shown  in  figure  4.6.  Note  that  their  partitioning  of  feature  space  deviates 
significantly  from  the  Bayes-optimal  partitioning;  they  exhibit  a  10.1%  error  rate  —  a  poor  approximation 
to  the  Bayes  error  rate  of  4.9%. 

These  results  illustrate  a  simple  fact:  if  there  is  a  proper  model  for  the  data,  probabilistic  learning 
will  generate  the  efficient  classifier  from  it  by  exploiting  the  proper  nature  of  the  model.  Indeed,  it  can 
be  argued  (via  the  Cramer-Rao  notion  of  efficient  parameter  estimation  [107]  (22,  ch’s.  32-33])  that 


Figure  4.4;  The  empirical  class-conditional  pdfs  of  x  multiplied  by  their  empirical  class  prior  probabilities 
for  a  training  sample  size  of  10  examples;  they  are  shown  in  histogram  form  (dark  gray),  superimposed  on  the 
true  class-conditional  pdf  —  class  prior  probability  products  (lighter  gray).  There  are  five  examples  for  each 
of  the  two  classes.  The  bar-graph  below  the  class-conditional  ^fs  depicts  the  Bayes-optimal  partitioning  of 
feature  space;  the  class  boundary  occurs  where  the  bar-graph  shifts  from  class  LO  \  to  class  Ctt2- 


the  probabilistically-generated  proper  parametric  model  exploits  all  of  the  Fisher  information  (e.g.,  [2])* 
contained  in  the  training  sample,  whereas  differential  learning  by  its  very  nature  does  not.  We  stress  that 
the  Fisher  information  content  of  the  training  sample  pertains  to  the  unknown  parameters  of  the  model;  it 
does  not  pertain  specifically  to  the  Bayes-optimal  class  boundaries  on  feature  space.  Thus,  the  information  is 
useful  (i.e.,  it  is  valid  information)  for  pattern  recognition  purposes  only  if  the  model  is  indeed  proper. 

Efion  has  proven  that  the  fully-parametric  proper  model  is  the  most  efficient  classifier  of  the  2-class 
feature  vector  with  homoscedastic  Gaussian  pdfs;  the  partially-parametiic  model  is  somewhat  less  efficient 

'See  [28,  sec.  7.8)  for  an  concise,  readable  discussion  of  Fisher  informalion  and  its  relationship  to  the  Cramer-Rao  bound. 
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Figure  4.5:  The  empirical  a  posteriori  class  probabilities  of  x  for  the  10  examples  of  figure  4.4;  they  are 
shown  in  histogram  form  (dark  gray),  superimposed  on  the  true  probabilities  (lighter  gray),  which  are  shown 
only  for  the  finite  interval  —10  <  x  <  10.  Histograms  take  on  a  default  value  of  zero  for  regions  on  the 
domain  of  x  where  no  training  samples  occur.  The  discriminant  functions  of  the  CE-generated  logistic  linear 
classifier  (i.e.,  the  partially-parametric  model  of  x )  are  superimposed  in  black.  Note  that  the  CE-generated 
classifier’s  partitioning  of  feature  space  is  a  close  approximation  to  the  Bayes-optimal  partitioning. 


Figure  4.6:  The  same  empirical  a  posteriori  class  probabilities  shown  in  figure  4.5.  The  discriminant  functions 
of  the  CFM-generated  logistic  linear  classifier  are  superimposed  in  black.  Note  the  larjge  gap  between  the 
examples  of  UJ  \  and  UJi  on  the  domain  of  x:  since  differential  learning  via  CFM  is  discriminative,  any  set  of 
discriminant  functions  that  forms  a  class  boundary  in  this  gap  is  ’’optimal”.  As  a  result,  the  (TFM-generated 
classifier’s  partitioning  of  feature  space  is  a  poor  approximation  to  the  Bayes-optimal  partitioning,  given  this 
small  training  sample  size  ( n  =  10). 


4. 2  Proper  Parametric  Model 
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[30].  As  we  describe  in  section  F.3,  Efron’s  notion  of  asymptotic  relative  efficiency  differs  from  ours:  he 
defines  ARE  as  the  ratio  of  one  model’s  discriminant  bias  to  another  model’s;  we  define  ARE  as  the  ratio 
of  one  model’s  MSDE  (j<7Mared  discriminant  bias  plus  discriminant  variance)  to  another  model’s.  This 
difference  notwithstanding,  the  philosophical  motivation  for  both  definitions  is  similar.  Since  our  feature  x 
has  class-conditional  means  of /I  I  =  —1.65  and //2  =  1.65  and  a  variance  parameter  of  =  1,  the 
Mahalanobis  distance  (e.g.,  [29,  pg.  24])  between  the  class-conditional  means  is  3.3.  Given  this  and  the  equal 
class  prior  probabilities  of  Efron  predicts  that  the  asymptotic  discriminant  bias  of  the  fully-parametric 
proper  model  will  be  ~  .55  that  of  the  partially-parametric  model  (see  (30,  ( 1 . 1 2)]). 

Figure  4.7  displays  the  approximated  MSDE  (~MSDE)  for  the  four  classifiers  in  figure  4.3;  figure  4.8 

displays  their  approximated  discriminant  bias  (~DBias)  for  the  experiments.  The  box  plots  in  figure  4.3 

display  the  results  in  quartiles,  whereas  ~MSDE  and  ~DBias  are  based  on  sample  averages;  this  accounts 

for  slight  differences  among  the  figures.  Again,  the  fully-parametric  proper  model  (ML)  and  the  partially- 

parametric  proper  models  (CE,  MSE)  are  all  probabilistically-generated;  one  partially-parametric  proper 

model  (CFM)  is  differentially-generated.  The  ML  model’s  ~MSDE  is  consistently  lowest  for  all  finite 

training  sample  sizes;  the  other  probabilistically-generated  models’  ~MSDE  is  appreciably  lower  than  the 

CFM  model’s  for  a  training  sample  size  of  10.  For  sample  sizes  greater  than  100  all  the  partially-parametric 

models  are  roughly  equivalent.  For  asymptotically  large  training  sample  sizes,  all  four  classifiers  exhibit  zero 

MSDE.  The  gray-shaded  region  in  figure  4.7  denotes  values  of  ~MSDE  less  than  1 0"^.  We  consider  all 

classifiers  that  exhibit  ~MSDE  below  this  threshold  to  be  equally  good  approximations  of  the  Bayes-optimal 

classifier.  To  put  this  in  perspective,  a  classifier  with  0.1%  discriminant  bias  and  no  discriminant  variance 

exhibits  a  MSDE  of  10“®,  as  does  a  classifier  with  no  discriminant  bias  and  a  discriminant  variance  of 

I0“*.  Thus,  this  MSDE  threshold  constitutes  a  rigorous  standard  of  good  approximation.  The  gray-shaded 

region  in  figure  4.8  denotes  values  of  ~DBias  less  than  10“’,  a  threshold  that  is  identical  to  the  ~MSDE 

threshold  if  the  classifier  has  no  discriminant  variance.  We  consider  all  classifiers  that  exhibit  ~DBias 
below  this  threshold  to  be  equally  unbiased  approximations  of  the  Bayes-optimal  classifier.  Both  figures 

show  that  probabilistic  learning  generates  more  efficient,  less  biased  classifiers  for  small  training  sample 

sizes  than  differential  learning  does.  As  the  training  sample  size  grows  large  (i.e.,  as  it  exceeds  lO’), 

the  differentially-generated  classifier  becomes  as  good  an  approximation  to  the  Bayes-optimal  classifier  as 

any  of  the  probabilistic  models  — a  phenomenon  consistent  with  the  asymptotic  efficiency  of  differential 

learning.  It  is  not  surprising  that  the  ML  model  is  consistently  more  efficient  than  all  the  others.  Indeed, 

based  on  our  lO-trial  experiments,  the  ~DBias  of  the  ML  model  is  3  x  10'^ ,  whereas  it  is  7  x  10“^  for  the 

CE  model.  Since  the  ML  model  is  the  fully-parametric  maximum-likelihood  paradigm  and  the  CE  model  is 

the  partially-parametric  maximum-likelihood  paradigm,  Efron’s  prediction  applies.  We  denote  the  logistic 

linear  hypothesis  class  (ultimately  employed  by  both  the  ML  and  CE  models)  by  G(©) ;  we  denote  the 

ML  model’s  maximum-likelihood  probabilistic  learning  scheme  by  Ap.ml.  and  we  denote  the  CE  model’s 
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Figure  4.7:  A  comparison  of  the  approximated  mean-squared  discriminant  error  (~MSDE)  for  the  differen¬ 
tially  (CFM)  and  probabilistically  (MSE,  CE,  and  ML)  generated  classifiers.  Results  for  the  differentially 
generated  classifier  are  shown  by  the  solid  line;  those  for  the  probabilistically  generated  classifiers  are  shown 
by  dashed  lines.  The  gray  background  depicts  the  value  of  ~MSDE  below  which  we  consider  all  classifiers 
equally  good  approximations  to  the  Bayes-optimal  classifier.  The  CFM-generated  classifier  is  not  as  efficient 
as  its  probabilistically-generated  counte^arts  when  the  training  sample  size  is  small  (O  [10]);  however,  owing 
to  the  asymptotic  efficiency  of  differential  learning,  the  difference  between  the  CFM-generated  classifier  and 
its  probabilistically-generated  counterparts  is  negligible  for  sample  sizes  greater  than  O  [  10-^]  (cf.  figure  4.3). 


<  -r 


Training  Sample  Size 

Figure  4.8:  A  comparison  of  the  approximated  discriminant  bias  (~DBias)  for  the  differentially  (CFM) 
and  probabilistically  (MSE,  CE,  and  ML)  generated  classifiers.  Results  for  the  differentially  generated 
classifier  are  shown  by  the  solid  line;  those  for  the  probabilistically  generated  classifiers  are  shown  by 
dashed  lines.  The  gray  background  depicts  the  value  of  <'^DBias  below  which  we  consider  all  classifiers 
equally  good  approximations  to  the  Bayes-optimal  classifier.  The  CFM-generated  classifier  exhibits  higher 
discriminant  bias  than  its  probabilistically-generated  counterparts  when  the  training  sample  size  is  small 
(C?[10]);  however,  owing  to  the  asymptotic  efficiency  of  differential  learning,  the  difference  between  the 
CFM-generated  classifier  and  its  probab*listically-generated  counterparts  is  negligible  for  sample  sizes  greater 
than  O  [lO^]  (cf.  figure  4.3). 


4. 3  Improper  Parametric  Model 
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maximum-likelihood  probabilistic  learning  scheme  by  Ap.cE  •  Based  on  the  probabilistic  nature  of  x  Efron’s 
prediction  is 


DBias  [Q I  n,  G{0),  Ap-ml] 
DBias  [0 1 n,  G(0),  Ap.ce] 


Our  experiments,  depicted  in  figure  4.8,  yield 


(4.26) 


~  DBias  [Q  I  w,  G(0),  Ap.ml1 
~  DBias  [Q  I  n,  G(0),  Ap.cE 


(4.27) 


which  is  a  good  approximation  to  Efron’s  prediction,  considering  the  small  number  of  trials  and  the  small 
training  sample  size  used. 

Efron  poses  (and  answers)  the  rhetorical  question,  “why  use  the  parlially-parametric  model  if  the 
fully-parametric  model  is  more  efficient.”  The  reason  is  that  the  fully-parametric  model  is  proper  if  and  only 
if  X  has  homoscedastic  Gaussian  class-conditional  pdfs;  the  partially-parametric  model  remains  proper  for 
a  broader  set  of  exponentially-distributed  feature  vectors  (e.g.,  [83]).  We  extend  the  rhetorical  question  one 
more  level:  why  use  the  differentially-generated  parametric  model  if  the  probabilistically-generated  models 
are  more  efficient?  The  reason  is  that  the  fully-  and  partially-parametric  models  are  proper  for  x  only  so  long 
as  it  has  a  specific  probabilistic  form;  if  the  probabilistic  nature  of  x  deviates  from  this  form,  the  parametric 
models  are  no  longer  proper.  Under  these  circumstances,  differential  learning  will  still  generate  the  most 
efficient  classifier  allowed  by  the  parametric  model  for  asymptotically  large  training  sample  sizes,  whereas 
the  probabilistic  learning  strategies  will  generate  decidedly  inefficient  classifiers  from  the  model  for  both 
small  and  large  training  sample  sizes.  We  analyze  an  improper  scenario  in  the  following  section  in  order  to 
illustrate  this  point. 


4.3  Analysis  of  an  Improper  Parametric  Model 


Figure  4.9  illustrates  a  three-class  scalar  x  with  heteroscedastic’  uniform  class-conditional  pdfs  for  the 
three  classes  (U7|  ,U?2.<^.i).  There  are  two  class  boundaries  (B\^2Binrs  =  —4.0,  Bayes  =  4.0)  for  the 
Bayes-optimal  classifier  of  jr.  The  class-conditional  pdfs  of  or  are  given  by 


Px]w  {  {“(■'  +  5.8)  -  U(x  -I-  3.8)] 

Px\w  (^1^2)  =  5  +  4)  -  U(x  -  4)1  (4.28) 

Px\w  (^1^.1)  =  5  N-'  -  5-8)  -  U(x  -  5.8)1  , 
where  U(  • )  denotes  the  Heaviside  step  function.  'The  class  prior  probabilities  are 


’Heteroscedastic  pdfs  have  difTerent  variance  parameters  (or  covariance  matrices). 
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P)v(t<t>’i)  =  Pw(t*^j)  =  0.1 
Pn  (Ct;2)  =  0.8 , 

and  the  a  posteriori  cla.ss  probabilities  of  x  are  given  by 


(4.29) 


Pw|r  |jr)  =  U(.r  +  5.8)  -  2  U(jr  +  4)  -  1  u{jt  +  3.8) 

Pvv|r  (UJl  k)  =  2  u(x  +  4)  +  i  U(x  +  3.8)  -  2  U(x  -  3.8)  -  ^  U(jt  -  4)  (4.30) 

Pvvb  ((^.1  k)  =  5  U(jr  -  3.8)  +  2  u(x  -  4)  -  u(x  -  5.8) 

Thus,  the  Bayes  error  rate  is  2.0%,  given  the  following  classification  strategy; 

X  <  choose  a;  1 

'B|,2fl<,y«  <  X  <  choOSC  0^2  (4.31) 

X  >  ^2,3  Baxes .  choOSe  (jjy 

43.1  The  Improper  Parametric  Model 

We  learn  to  classify  x  with  a  3-output  discriminator  that  has  polynomial  discriminant  functions  of  the  form 

g(x\0)  =  |y,  =  g,(x|»)  =  ^  0,.t  •  W*  ;  i  =  \ . C  =  3|  ,  (4.32) 

where  Ki  represents  the  order  of  the  polynomial  expression  for  the  /th  discriminant  function  (again,  we  use 
the  notation  (x)  *  to  denote  the  itth  power  of  x ,  as  opposed  to  x  *,  which  denotes  the  Kth  example  of  x ).  As 
described  in  section  2.2.1,  we  interpret  the  discriminator  output  with  the  largest  value  as  the  classifier’s  vote 
for  the  class  of  its  scalar  input  x.  This  polynomial  classifier  is  depicted  in  figure  4. 10  and  it  is  generated  with 
a  modified  form  of  the  backpropagation  algorithm  (e.g.,  [119,  120]).  It  is,  by  definitions  3.13  and  3.14,  an 
improper  parametric  model  of  x  because  the  discriminant  functions  of  (4.32)  are  not  under  any  circumstances 
identically  equal  to  the  a  posteriori  class  probabilities  of  (4.30). 

Given  our  interpretation  of  the  classifier’s  outputs  in  section  2.2.1,  it  is  clear  that  if  yi  and  ys  are  linear 
functions  of  x  and  yi  is  a  constant,  the  resulting  classifier  depicted  by  the  white  nodes  and  black  connections 
in  figure  4.10  has  the  minimum  functional  complexity  necessary  to  learn  the  Bayes-optimal  classifier  of 
X  (the  gray  nodes  and  connections  depict  more  complex  hypothesis  classes  —  equating  to  higher  order 
polynomial  expressions  in  (4.32)  —  for  this  task).'® 

"’Reference  [52]  incorrectly  slates  that  the  minimum-comptesily  polynomial  classifier  has  iwo  linear  discriminant  functions  and  one 
quadratic  discriminant  function.  As  described  herein,  the  third  discriminant  function  need  only  be  a  constant  for  the  classifier  to  yield 
Bayesian  discrimination. 


4.3  Improper  Parametric  Model 
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Figure  4.9:  A  three-class  scalar  feature  discrimination  task.  TTie  single  feature  is  a  heteroscedastic,  uniformly- 
distributed  random  variable.  From  top  to  bottom:  the  class-conditional  density  -  class  prior  products 
/9,|yv  (j:  ICJ,')  ■  Pw(U/i);  the  pdf  of  jr  PxW'*  **  °  posteriori  probubility  of  class  OJi  Pwi*  (CJi  |*) ;  the  a 
posteriori  probabilities  of  classes  I  and  Ot  j  Pw|.t  (<<(^1  k)  .P>vh(t^3|jr)- 
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Figure  4.10:  The  polynomial  classifier  of  x  depicted  as  a  neural  network  paradigm.  Hidden  layer  nodes 
compute  powers  of  jr;  output  nodes  are  linear  combinations  of  these  powers  —  cf.  (4.32).  The  polynomial 
classifier  with  the  minimum  complexity  necessary  for  Bayesian  discrimination  of  x  is  indicated  by  the  white 
nodes  and  black  "connections”  (i.e.,  parameters);  the  minimum-complexity  parameters  are  labeled. 


Minimum*,  low-,  and  high-complexity  ciassifiers;  For  the  purpose  of  this  illustration  Kj  — the 
order  of  the  ith  polynomial  in  (4.32)  — may  be  taken  us  the  complexity  measure  for  the  ith  dis¬ 
criminant  function  gi(x|0).  We  generate  classifiers  from  three  polynomial  hypothesis  classes  of  in¬ 
creasing  complexity.  As  described  above,  the  minimum-complexity  hypothesis  class  has  discriminant 
functions  of  order  AT]  =  I,  Ki  =  0,  and  Kj  =  1;  we  often  use  the  notation  "1-0-1”  to  denote 
this  hypothesis  class.  Our  choice  of  low-complexity  hypothesis  class  has  discriminant  functions  of  or¬ 
der  K\  =  1,  f(2  =  2,  and  ATj  =  1;  we  often  use  the  notation  ”1-2-1”  to  denote  this  hypothesis 
class.  Our  choice  of  high-enmplexity  hypothesis  class  has  discriminant  functions  of  order  K\  =  10, 
K2  =  10,  and  Ky  =  10,'  we  often  use  the  notation  "10-10-10”  to  denote  this  hypothesis  class. 


4.3.2  Probabilistic  Learning  via  MSE  for  the  Asymptotically  Large  Training  Sample 


Probabilistic  learning  is  implemented  by  minimizing  a  measure  of  the  difference  between  the  discriminator 
output  vector  Y  and  a  corresponding  target  vector  denoting  the  class  of  the  training  example  (see  section  2.3); 
the  minimization  is  done  for  all  examples  in  the  training  sample,  and  generally  takes  the  form  of  an  iterative 
search  procedure.  We  employ  backpropagation,  a  well-known  probabilistic  learning  paradigm;  its  iterative 
search  procedure  is  gradient  descent,  and  the  gradient  of  the  classifier's  MSE  with  respect  to  the  parameter 
vector  0  is  computed  by  the  chain-rule  [1 19,  120)." 

"Backpropagation  generally  employs  MSE,  although  other  objective  functions  can  be  used.  We  employ  only  the  MSE  objective 
function  for  probabilistic  learning.  The  CE  objective  function,  for  example,  cannot  be  used  because  the  polynomial  classifier's  outputs 
are  unbound^;  this  violates  the  conditions  necessary  for  using  CE  (see  section  2..X.2).  When  paired  with  the  CFM  objective  function 
and  a  gradient  ascent  search,  backpropagation  constitutes  a  differential  learning  strategy. 
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Again,  we  denote  the  training  sample  of  size  n  by  S",  and  we  denote  a  particular  unique  value  (or 
pattern)  of  x  by  Xp.  If  there  are  P  unique  patterns  in  S" ,  and  for  each  of  these  patterns  there  are 
examples  belonging  to  class  CJ„  the  sample  MSE  is  given  by 


t’=3  p 


MSE(S'’\0)  =  II  E  T 


1= I  p— I 


-  I)  •  -r  +  U'Upl®)) 


E  'W  =  « 

p=i 


(4.33) 


From  section  2.3.2  we  know  that  the  classifier’s  MSE  can  be  expressed  by  the  following  expectation  as  the 
training  sample  size  grows  asymptotically  large: 


e4mse(x|^)]  = 


-  1)'  •  Ptvh(a>,|x)  +  (g,(x|e))'  •  p»v|,(-a;,|x)]  p^{x)dx{4M) 


where 


=  1  -  (4-35) 

The  parameterization  6“  that  minimizes  the  classifier’s  MSE  for  the  asymptotically  large  training  sample 
size  can  be  found  by  substituting  the  discriminant  function  expressions  of  (4.32)  into  (4.35),  deriving  the 
expression  for  the  gradient  (E^  [MSE  (x|©*)]),  setting  this  gradient  equal  to  the  zero  vector,  and 
solving  the  resulting  normal  equations  (see  section  2.3.2). 

Barnard  and  Casasent  use  this  technique  for  deriving  the  minimum-MSE  parameterization  of  a  linear 
classifier,  given  a  2-class  Guassian  feature  (6).  Appendix  H  derives  distribution-independent  expressions 
for  the  asymptotic  minimum-MSE  parameterization  of  the  ith  discriminant  function  gi(x|0)  in  (4.32); 
expressions  are  given  for  constant,  linear,  and  quadratic  discriminant  functions  (i.e.,  for  AT,-  =  0,1 ,2). 
Distribution-independent  expressions  for  the  minimum-MSE  parameterizations  of  higher-order  polynomial 
discriminant  functions  become  cumbersome,  so  we  derive  the  minimum-MSE  parameterization  of  the  high- 
complexity  classifier  (i.e.,  the  MSE-generated  "10-10-10"  model)  in  distribution-dependent  form  using 
(4.28),  (4.30),  (4.32),  and  and  (4.35).  Table  4.1  summarizes  the  results  of  appendix  H;  it  lists  the  minimum- 
MSE  parameterizations  of  the  minimum-,  low-,  and  high-complexity  classifiers,  given  an  asymptotically 
large  training  sample  size. 
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Asymptotic  Minimum-MSE  Parameterizations  n  -r  oo 

Minimum-Complexity  Classifier  “1-0-1" 

1 

gi{x\e) 

02.0  =  0.8 

02,1  =  0 

02.1  =  0.0536833 

Lx)w-Complexity  Classifier  “1-2-1” 

«2(jc|0) 

02,0  =  1.13764 

02,0  =  0.1 

03.1  =  0.0536833 

0|,2  =  0 

II 

o 

High-Complexity  Classifier  “10-10-10” 

«2(-r|0) 

=  -0.0222177 

02,0  =  1.04444 

02,0  =  -0.0222177 

=  -0.0535861 

02,1  =  0 

02,1  =  0.0535861 

6i,2  -  0.0513984 

02,2  =  -0.102797 

02,2  =  0.0513984 

=  0.0273086 

02,3  =  0 

03,2  =  -0.0273086 

^1.4  =  -0.0172172 

02.4  =  0.0344344 

03.4  =  -0.0172172 

0i,s  =  -0.00319705 

02,5  =  0 

02,5  =  0.00319705 

01,6  =  0.00179515 

Qif,  =  -0.00359029 

03,6  =  0.00179515 

0|,7  =  0.000110813 

Kl 

II 

o 

03,7  =  -0.000110813 

0i,g  =  -0.0000664743 

02,8  =  0.000132949 

03,8  =  -0.0000664743 

0i,9  =  -0.00000120399 

02,9  =  0 

03,9  =  0.00000120399 

01.10  =  .000000815606 

02,10  =  -0.00000163121 

03.10  =  .000000815606 

Table  4. 1 ;  The  tninimum-MSE  parameterizations  for  the  minimum-,  low-,  and  high-complexity  polynomial 
classifiers  of  x  when  the  training  sample  size  n  is  asymptotically  large  (i.e.,  n  -*  oc). 


4.3.3  Differential  Learning  via  CFM  for  the  Asymptotically  Large  Training  Sample  ^ 

Differential  learning  is  implemented  for  the  improper  parametric  model  in  the  same  way  it  is  implemented 
for  the  partially-parametric  proper  model  of  section  4.2.3.  The  minimum-complexity  polynomial  classifier 
maximizes  CFM,  satisfies  (4.20),  and  constitutes  the  Bayes-optimal  classifier  when  its  parameters  satisfy  the 

following  constraints:  ^ 


^1,1  CFM  <  0 

('.l.l  CFM  >  0 

^\,\CFM  ■  B\,2 Barts  +  OffiCFM  -  ^2,0 CFM  =  0 

0\\CFM  ■  B2,y  Barts  +  OyfiCFM  -  hjOCFM  = 


(4.36) 


0 
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When  these  conditions  are  satisfied,  the  resulting  class  boundaries  equal  the  Bayes-optimal  boundaries: 


6|,2CfW  =  fil,2BiivM 

^2,1  CFM  =  ^2,3  Basts 


(4.37) 


4.3.4  Results  of  Differential  and  Probabilistic  Learning  for  Asymptotically  Large 
and  Small  Training  Samples 

Regardless  of  the  strategy  used  to  determine  the  classifier’s  parameters,  the  resulting  class  boundaries  occur 
at  all  X  for  which  more  than  one  discriminant  function  is  maximal.  For  the  minimum-complexity  polynomial 
classifier,  this  occurs  at 


fi.,2 

^2,3 


^2,0  —  ^1,0 
^1,1 

^2,0  —  ^3,0 
^3,1 


(4.38) 


Figure  4. 1 1  illustrates  the  discriminant  functions  of  several  polynomial  classifiers  that  have  learned  to  rec¬ 
ognize  the  three  classes  that  x  represents,  given  an  asymptotically  large  training  sample.  Both  the  top  and  bot¬ 
tom  figures  show  the  discriminant  functions  superimposed  in  color  on  the  gray  a  posteriori  class  probabilities 
P»V|jt  ((^1 1->^)  .P»V|x  (^2 1-*)  ’  Pwb  (^3 1  x) .  There  is  a  bar-graph  display  associated  with  each  classifier 
underneath  the  discriminant  functions.  The  bar-graph  shows  how  its  associated  classifier  partitions  feature 
space. The  Bayes-optimal  classifier’s  partitioning  is  always  shown  in  gray  for  reference.  The  top  figure 
shows  two  minimum-complexity  classifiers:  one  generated  probabilistically  via  the  MSE  objective  function 
(red,  short-dashed  lines),  and  one  generated  differentially  via  the  CFM  objective  function  (solid  green 
lines).  Because  the  probabilistically  generated  classifier  is  attempting  to  approximate  Pw|x  ((>1^1 1 2c)  and 
Pw|x  (^3 1  x)  with  linear  functions  of  x  and  Pw|x  (^3 1  x)  with  a  constant,  the  minimum-MSE  discriminant 
functions  are  as  shown  (their  parameter  values  are  given  in  table  4.1,  lop),  and  the  resulting  classifier  labels 
all  examples  of  jr  as  UJ2-  As  a  result,  the  classifier  misclassifies  all  examples  of  U\  and  Uly,  its  error  rate 
is  therefore  20%.'^  The  differentially-generated  classifier  shown  is  one  set  of  an  infinite  number  of  possible  . 
maximum-CFM  discriminant  functions.  Its  parameterization  is  such  that  g,-(jc  1 9)  is  always  maximum  on  the 


'^Note  that  the  legend  “CFM  t-O-l",  for  example,  denotes  the  differentially  generated,  miniimiin-complexity  polynomial  classifier. 

'^The  MSE-generaled  minimum-complexity  classifier  would  exhibit  an  error  rale  much  closer  10  the  Bayes-optimal  rate  of  2%  for 
large  training  tample  sizes  if  the  linear  discriminant  functions  for  class  U2|  and  Ulj  were  replaced  by  logistic  discriminant  functions. 
Thir  is  because  the  resulting  hypothesis  class  would  be  a  substantially  bener  approximation  to  the  proper  parametric  model  of  x .  This 
extends  to  a  general  argument  for  hypothesis  classes  with  a  logistic  functional  basis;  many  real-world  feature  vectors  have  unimodal 
class-conditional  pdfs,  so  their  a  posteriori  class  probabilities  are  reasonably  well  modeled  by  a  logistic  functional  basis.  This  accounts 
for  the  success  arid  wide-spread  use  of  probabilistically-generated  logistic  regression  models  and  multi-layer  perceptrons. . .  a  subject 
we  address  further  in  chapter  1 1 .  Of  course,  our  choice  of  the  functional  basis  is  intentionally  malicious  here:  we  wish  to  illustrate  the 
disadvantages  of  probabilistic  learning  when  the  parametric  model  is  improper. 
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sub-domain  of  x  for  which  UJi  is  the  most  likely  class.  As  a  result,  the  classifier  satisfies  (4.20)  and  exhibits 
the  Bayes  error  rate  of  2%. 

It  is  clear  from  figure  4.11  (top)  that  the  minimum-complexity  classifier  has  insufficient  functional 
complexity  to  learn  the  Bayes-optimal  classifier  probabilistically  (i.e.,  to  approximate  the  a  posteriori  class 
probabilities  of  x ).  Figure  4. 1 1  (bottom)  illustrates  what  is  required  to  do  this.  If  we  increase  the  complexity 
of  g2(x\0)  by  making  it  a  quadratic  function  of  x,  the  resulting  minimum-MSE  discriminant  functions  are 
shown  in  shoit-dashed  red  lines  (their  parameter  values  are  given  in  table  4. 1  middle).  This  low-complexity 
classifier  has  enough  functional  complexity  to  classify  some  examples  of  a)|  and  UJy  correctly,  although  it 
still  lacks  sufficient  complexity  for  Bayesian  discrimination.  Its  error  rate  is  7.8%.  The  differentially  generated 
low-complexity  classifier  (solid  and  shaded  green  lines)  —  like  its  minimum-complexity  counterpart  — 
yields  the  Bayes  error  rate  of  2%.  Again,  there  are  innumerable  maximum-CFM  parameterizations  for  the 
differentially-generated  classifier;  the  green  shaded  lines  in  the  figure  depict  several  of  these.  Finally,  we 
increase  the  complexity  of  the  probabilistically  generated  classifier  so  that  all  three  discriminant  functions  are 
lOth-order  polynomials  in  jc.  These  discriminant  functions  are  shown  by  the  blue  dashed  lines  in  the  lower 
figure;  only  the  MSE-generated  classifier  is  shown  (its  parameter  values  are  given  in  table  4. 1 ,  bottom). 
This  high-complexity  classifier  has  sufficient  functional  complexity  to  approximate  the  a  posteriori  class 
probabilities  of  jr  reasonably  well  when  generated  via  the  MSE  objective  function;  it  exhibits  a  2.2%  error 
rate  —  nominally  the  Bayes  error  rate. 

Figure  4. 1 1  illustrates  that  differential  learning  requires  the  minimum-complexity  polynomial  classifier . 
necessary  for  Bayesian  discrimination.  The  minimum-complexity  requirements  of  differential  learning  hold 
for  any  and  all  choices  of  hypothesis  class,  as  proven  in  section  3.S.  Probabilistic  learning,  in  contrast, 
requires  a  high-complexity  polynomial  classifier  in  order  to  approximate  the  Bayes-optimal  classifier  —  a 
result  that  is  representative  of  the  generally  excessive  complexity  requirements  of  probabilistic  learning,  the 
single  notable  exception  being  when  the  hypothesis  class  is  a  proper  parametric  model. 

So  far  we  have  considered  the  asymptotic  case  in  which  we  have  an  unlimited  number  of  training 
examples.  It  is  more  realistic  to  consider  the  case  in  which  we  have  a  limited  amount  of  training  data. 
Figure  4. 1 2  depicts  the  same  classifiers  shown  in  figure  4. 1 1  with  one  difference:  the  classifiers  in  figure  4.12 
have  been  generated  with  a  single  training  sample  containing  only  n  =  100  examples  of  x  (the  different 
classifiers  all  learn  the  same  100  examples).  The  minimum-complexity  classifiers  (top)  behave  in  much  the 
same  way  as  they  do  for  the  asymptotically  large  training  sample.  The  probabilistically  generated  classifier 
(red,  short-dashed  lines)  misclassifies  all  examples  of  U)|  and  Uj.  Owing  to  the  small  sample  size,  the 
empirical  a  posteriori  class  probabilities  of  jc  are  crude  approximab'ons  to  the  true  probabilities.  As  a  result, 
the  differentially-generated  classifier’s  partitioning  of  feature  space  (solid  green  lines)  deviates  slightly  from 
the  Bayes-optimal  partitioning,  and  its  error  rate  is  3.4%.  The  low-complexity  classifiers  (bottom)  also 
behave  in  much  the  same  way  as  they  do  for  the  asymptotically  large  training  sample.  The  probabilistically 
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Figure  4. 1 1 :  Discriminant  functions  of  probabilistically  (MSE)  and  differentially  (CFM)  generated  polynomial 
classifiers  of  x  for  an  asymptotically  large  training  sample  size  (i.e.,  n  oc).  The  functions  are  shown 
superimposed  on  their  associated  a  posteriori  class  probabilities  (shown  in  gray).  Each  of  the  bar-graphs 
underneath  the  discriminant  functions  depicts  how  its  associated  polynomial  classifier  partitions  feature 
space.  Top:  the  minimum-complexity  classifier  (“1-0-1")  having  one  constant  and  two  linear  discriminant 
functions.  Bottom:  a  low  complexity  classifier  (“1-2-1")  having  one  quadratic  and  two  linear  discriminant 
functions,  and  a  high-complexity  classifier  (“10-10-10")  having  three  lOth-order  polynomial  discriminant 
functions  (MSE-generated  only).  Numerous  low-complexity  CFM-maximizing  classifiers  are  shown  (green 
shaded  lines)  in  order  to  emphasize  that  there  are  innumerable  optimal  solutions  when  differential  learning  is 
employed. 


Figure  4. 1 2:  Discriminant  functions  of  probabilistically  (MSE)  and  differentially  (CFM)  generated  polynomial 
classifiers  of  x  for  a  typical  training  sample  of  size  n  =  100.  Again,  the  functions  are  shown  superimposed 
on  their  associated  a  posteriori  class  probabilities  (shown  in  gray),  and  each  of  the  bar-graphs  underneath 
the  discriminant  functions  depicts  how  its  associated  polynomid  classifier  partitions  feature  space.  Top: 
the  minimum-complexity  classifier  (“1-0-1”)  having  one  constant  and  two  linear  discriminant  functions. 
Bottom:  a  low  complexity  classifier  (“1-2-1”)  having  one  quadratic  and  two  linear  discriminant  functions, 
and  a  high-complexity  classifier  (“10-10-10”)  having  three  lOth-order  polynomial  discriminant  functions 
(MSE-generated  onlyl 
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generated  classifier  (red,  short-dashed  lines)  exhibits  an  error  rate  of  7.8%,  and  the  differentially-generated 
classifier  (solid  green  lines)  exhibits  an  error  rate  of  3.3%. 

Recall  that  the  high-complexity  classifier  exhibits  a  2.2%  error  rate  for  the  asymptotically  large  training 
sample.  Since  the  empirical  a  posteriori  class  probabilities  of  x  are  cnide  approximations  to  the  true 
probabilities  when  n  =  100,  there  are  regions  on  the  domain  of  x  where  no  training  examples  occur.  The 
classifier’s  discriminant  functions  are  unconstrained  in  these  regions  during  learning.  Figure  4.12  (bottom) 
illustrates  what  happens  as  a  result.  The  high-complexity  classifier's  discriminant  functions  (blue  dashed 
lines)  are  unconstrained  for  values  of  x  above  ~  5.2  and  below  ~  —5.0  because  the  training  sample  contains 
no  examples  beyond  these  limits.  As  a  result,  the  discriminant  function  for  is  maximal  for  jr  <  ~  —5.0 , 
the  discriminant  function  for  LO\  is  maximal  for  ~  5.2  <  x  <~  5.7,  and  the  discriminant  function  for 
LOi  is  maximal  for  x  >  ~  5.7 .  The  resulting  partitioning  of  feature  space  (bottom  blue  bar-graph)  is  poor, 
and  the  classifier  exhibits  a  7.8%  error  rate.  This  is  a  classic  expression  of  Occam's  razor  [  1 30,  2 1  ],  in  which 
the  classifier  has  so  much  functional  complexity  it  fails  to  generalize  well  for  small  training  sample  sizes. 

Figure  4.13  displays  the  empirical  distribution  of  the  error  rates  for  minimum-,  low-,  and  high-complexity 
polynomial  classifiers  of  x.  Results  for  differential  learning  via  CFM  are  shown  in  white  box  plots;  results 
for  probabilistic  learning  via  MSE  are  shown  in  gray  box  plots.  As  with  the  proper  parametric  model 
experiments,  all  results  for  finite  training  sample  sizes  are  based  on  10  independent  trials  for  the  specified 
training  sample  size  (all  classifiers  learn  the  same  training  sample  in  a  given  trial,  for  a  given  sample  size). 
Learning  takes  the  form  of  a  steepest  descent  (MSE)  or  steepest  ascent  (CFM)  search  over  parameter  space, 
using  a  modified  form  of  the  backpropagation  algorithm  (e.g.,  (119,  120]).  Learning  begins  from  a  tabula 
rasa  state  in  which  all  parameters  are  initialized  randomly  according  to  a  uniform  distribution  on  the  closed 
interval  (—.3 ,  .3).  All  trials  are  completely  automated,  so  learning  is  done  without  any  human  intervention. 
All  experimental  conditions  (except,  of  course,  for  the  objective  function  used)  are  identical  for  differential 
and  probabilistic  learning.  Classifier  parameterizations  for  the  asymptotically  large  training  sample  size  are 
derived  as  described  in  section  4.3.2,  appendix  H,  and  section  4.3.3. 

Figure  4.14  plots  sample-average  approximations  of  each  classifier’s  mean-squared  discriminant  error 
(~MSDE):  these  values  correspond  to  the  box  plot  statistics  in  figure  4. 1 3  (the  box  plots  display  the  results 
in  quartiles,  whereas  ^MSDE  is  based  on  sample  averages;  this  accounts  for  slight  differences  between  the 
two  figures).  The  minimum-complexity  differentially  generated  classifier  is  the  most  efficient,  exhibiting 
consistently  low  error  rates  for  small  training  sample  sizes.  Based  on  chapter  6,  we  predict  that  1 121  samples 
of  X  are  necessary  to  guarantee  (with  95%  confidence)  an  error  rate  of  no  more  than  4.0%  using  differential 
learning.  Note  that  the  empirical  upper  bound  on  (he  differentially  generated  minimum-complexity  classifier's 
error  rate  is  3.3%  when  the  sample  size  is  1000.  Increasing  the  differentially  generated  classifier’s  complexity 
increases  its  empirical  discriminant  variance,  according  to  Occam’s  razor  (i.e.,  excessively  complex  models 
are  anathema). 
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I  I  Differential  Learning  (CFM)  HI  Probabilistic  Learning  (MSE) 


Minimum  Complexity  Classifier  Low  Complexity  Classifier  High  Complexity  Classifier 

(1-0-1)  (1-2-1)  (10-10-10) 


Training  Sample  Size 

Figure  4.13:  A  comparison  of  error  rates  for  differentially  (CFM)  and  probabilistically  (MSE)  generated 
polynomial  classifiers.  Results  for  the  differentially  generated  classifiers  are  shown  in  white;  those  for  the 
probabilistically  generated  classifier  are  shown  in  gray.  Left:  the  minimum-complexity  classifier  having  one 
constant  and  two  linear  discriminant  functions;  Middle:  a  low  complexity  classifier  having  one  quadratic  and 
two  linear  discriminant  functions;  Right:  a  high-complexity  classifier  having  three  lOth-order  polynomial 
discriminant  functions. 


_  Differential  Learning  (CFM)  _  _  Probabilistic  Learning  (MSE) 


Figure  4. 1 4:  A  comparison  of  the  approximated  mean-squared  discriminant  error  ('^^MSDE)  for  differentially 
(CFM)  and  probabilistically  (MSE)  generated  polynomial  classifiers.  These  statistics  are  based  on  the  same 
data  used  to  generate  the  box  plots  in  figure  4.13.  Results  for  the  differentially  generated  classifiers  are 
shown  in  solid  lines;  those  for  the  probabilistically  generated  classifier  are  shown  in  dashed  lines.  The  gray 
background  depicts  values  of  '^'MSDE  for  which  we  consider  the  classifier  to  be  a  good  approximation  to 
the  Bayes-optimal  classifier.  Left:  the  minimum-complexity  classifier  having  one  constant  and  two  linear 
discriminant  functions;  Middle;  a  low  complexity  classifier  having  one  quadratic  and  two  linear  discriminant 
functions;  Right:  a  high-complexity  classifier  having  three  lOth-order  polynomial  discriminant  functions. 
Owing  to  the  inefficiency  of  probabilistic  learning,  none  of  the  MSE-generated  classifiers  is  relatively 
efficient  for  any  training  sample  size. 
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The  inefficiency  of  probabilistic  learning  is  clear  in  Figures  4. 1 3  and  4. 1 4.  The  minimum-complexity 
classifier  has  no  discriminant  variance,  but  its  approximate  discriminant  bias  is  20%  —  2%  =  18%.  The 
low-complexity  classifier  has  substantially  lower  approximate  discriminant  bias  ( 8.3%  —  2%  =  6.3%  for 
n  =  1000),  but  its  discriminant  variance  is  high  (2.2  x  10~-^ ).  The  high-complexity  classifier  has  moderate 
discriminant  bias  (5.4%  —  2%  =  3.4%  for  n  =  1000),  and  moderate  discriminant  variance  ( 1.3  x  10“*) 
—  substantially  better  than  the  probabilistically  generated  low-complexity  classifier,  but  substantially  worse 
than  the  differentially  generated  minimum-complexity  classifier. 

The  gray  background  of  figure  4.14  denotes  values  of  <^MSDE  for  which  we  consider  a  classifier  to  be 
a  good  approximation  to  the  Bayes-optimal  classifier.  Specifically,  if  the  classifier's  'V'MSDE  is  less  than 
10“*,  we  consider  it  a  good  approximation.'*  All  of  the  differentially-generated  polynomial  classifiers  are 
asymptotically  good  approximations  to  the  Bayes-optimal  classifier,  whereas  none  of  the  probabilistically- 
generated  classifiers  are.  Moreover,  the  minimum-  and  low-complexity  differentially-generated  classifiers  are 
between  one  and  two  orders  of  magnitude  more  efficient  than  their  probabilistically-generated  counterparts. 
As  the  model  complexity  becomes  high,  both  differentially  and  probabilistically  generated  classifiers  are 
inefficient  models  of  the  data  for  small  training  sample  sizes  —  a  clear  expression  of  Occam’s  razor. 

4.4  Summary 

The  ‘  ‘toy”  experiments  of  this  chapter  illustrate  the  theoretical  proofs  of  chapters  2  and  3.  Differential  learning 
is  asymptotically  efficient,  regardless  of  the  hypothesis  class  (i.e.,  parametric  model)  employed,  whereas 
probabilistic  learning  is  efficient  (for  both  large  and  small  training  sample  sizes)  only  if  the  hypothesis  class 
is  a  proper  parametric  model  of  the  data.  This  implies  a  kind  of  robust  beauty  in  the  differentially-generated 
classifier;  it  is  guaranteed  to  be  the  best  approximation  of  the  Bayes-optimal  classifier  allowed  by  the  model 
of  (he  data,  so  long  as  the  training  sample  size  is  sufficiently  large.  As  we  stated  in  chapter  3,  we  know  of  no 
other  learning  strategy  that  can  make  this  guarantee. 

There  is  no  doubt  that  probabilistic  learning  in  the  form  of  maximum-likelihood  parameter  estimation'* 
is  the  most  efficient  learning  strategy  if  the  parametric  model  is  indeed  a  good  approximation  to  the  proper 
one.  Our  contention  is  that  this  is  not  always  the  case.  If  the  parametric  model  is  simple,  traditional  statistical 
hypothesis  testing  procedures  (e.g.,  see  [  1 40])  can  verify  whether  or  not  it  is  proper.  If  the  model  is  complex, 
complexity  theory  argues  against  its  being  proper,  particularly  when  the  training  sample  size  is  small.  This 
leads  us  to  conclude  that  differential  learning  is  the  best  choice  of  learning  strategy  if  the  model  is  likely  to 
be  improper.  The  experiments  of  part  11  consistently  show  this  to  be  a  valid  conclusion. 


•^Again.  a  classifier  with  a  discriminant  bias  of  0.1%  and  a  discriminant  variance  of  zero  exhibits  an  ~MSDEof  10“*,  so  this  is  a 
rigorous  standard  for  good  approximation. 

'  *  We  remind  the  reader  that  maximum-likelihood  parameter  estimation  generally  equates  to  probabilistic  learning  via  an  error  measure 
objective  function. 


Chapter  5 


Properties  of  the  CFM  Objective 
Function ' 

Outline 

We  examine  the  relationship  between  the  objective  function’s  monotonicity  and  the  efficiency  of  the  learning 
strategy  it  implements:  the  objective  function  must  be  monotonic  for  the  learning  strategy  to  be  efficient.^ 
This  chapter’s  proofs  that  the  CFM  objective  function  is  monotonic  for  the  general  C-class  pattern  recognition 
task  parallel  the  chapter  3  proofs  that  differential  learning  is  asymptotically  efficient.  Likewise,  the  proofs  that 
error  measures  are  non-monotonic  for  the  general  C-ciass  pattern  recognition  task  parallel  the  chapter  3  proofs 
that  probabilistic  learning  is  inefficient.  Moreover,  probabilistic  learning  becomes  increasingly  inefficient  as 
the  number  of  classes  C  increases,  owing  to  the  increasingly  non-monotonic  nature  of  error  measures.  We 
develop  a  simple  taxonomy  of  training  examples  in  order  to  show  that  differential  learning  via  CFM  focuses 
on  un-leamed  examples.  Among  these,  there  are  easy  and  hard  examples.  We  explain  why  easy  examples 
can  be  learned  with  high  confidence,  whereas  hard  examples  must  be  learned  with  low  confidence.  We 
conclude  by  examining  the  specific  functional  characteristics  of  CFM  in  order  to  motivate  the  synthetic  form 
we  employ.  We  prove  that  differential  learning  via  the  synthetic  form  of  CFM  remains  both  efficient  and 
reasonably  fast  as  learning  confidence  is  reduced.  In  contrast,  differential  learning  via  the  original  functional 
forms  of  CFM  [SS]  is  unreasonably  slow  and/or  inefficient. 

5.1  Introduction 

Differentiable  supervised  classifiers  that  learn  iteratively  employ  an  objective  function  (or  empirical  risk 
measure)  that  evaluates  how  well  the  classifier  has  learned  to  classify  all  the  examples  of  the  training  sample. 
A  monotonic  objective  function  is  one  that  is  always  a  strictly  decreasing  (or  increasing)  hinction  of  the 
'  Pottiom  of  this  chapter  were  fint  published  in  (3.S]. 

^An  objective  function  is  monotonic  if  and  only  if  it  is  either  a  strictly  increasing  or  a  strictly  decreasing  function  of  the  classifier's 
empirical  training  sample  error  rate  (see  definition  S.  10). 
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classifier’s  empirical  training  sample  error  rate.  Chapter  2  establishes  a  link  between  families  of  objective 
functions  and  the  learning  strategies  they  implement:  error  measures  engender  probabilistic  learning,  whereas 
the  CFM  objective  function  engenders  differential  learning.  Section  3.3  indirectly  proves  that  maximizing 
the  CFM  objective  function  minimizes  the  classifier’s  empirical  training  sample  error  rate  (the  proof  is 
part  of  the  larger  proof  that  differential  learning  via  CFM  is  asymptotically  efficient).  In  other  words, 
CFM  is  a  monotonic  objective  function.  Section  3.4  proves  that  minimizing  the  general  error  measure 
does  not  minimize  the  classifier’s  empirical  training  sample  error  rate.  In  other  words,  error  measures 
are  non-monotonic  objective  functions;  they  engender  inefficient  learning  (unless  the  hypothesis  class  with 
which  they  are  paired  is  the  proper  parametric  model  of  the  feature  vector). 

In  this  chapter,  we  take  a  geometric  view  of  the  discriminator’s  output  state  in  order  to  illustrate  the 
monotonic  nature  of  the  CFM  objective  function  and  the  non-monotonic  nature  of  error  measures.  We 
demonstrate  that  probabilistic  learning  strategies  become  increasingly  inefficient  as  the  number  of  classes 
C  in  the  pattern  recognition  taik  .ncreases,  owing  to  the  non-monotonic  nature  of  error  measures.  In  the 
process,  we  develop  a  simple  taxonomy  of  training  examples.  Fundamentally,  each  training  example  falls 
into  one  of  two  categories:  learned  and  (as  yet)  un-leamed.  The  un-Ieamed  examples  are  either  easy  to  learn 
or  hard  to  learn  (terms  we  define  in  section  5.4).  We  show  that  differential  learning  via  the  synthetic  CFM 
objective  function  focuses  on  the  un-leamed  examples;  we  explain  why  easy  examples  can  be  learned  with 
high  confidence,  whereas  hard  examples  must  be  learned  with  low  confidence. 

We  conclude  by  analyzing  the  functional  characteristics  of  CFM  in  order  to  motivate  the  synthetic  form 
we  employ.  The  analysis  focuses  on  the  process  of  learning  hard  examples  with  necessarily  low  confidence. 
Since  the  differentiable  supervised  classifier  learns  by  searching  over  parameter  space,  the  speed  of  the 
learning  procedure  (i.e.,  its  convergence  rate)  is  proportional  to  the  step  size  of  the  search  procedure.  We 
prove  that  differential  learning  via  the  synthetic  form  of  CFM  is  reasonably  fast  for  both  easy  and  hard 
examples;  convergence  to  the  CFM-maximizing  parameters  proceeds  at  a  rate  that  decreases  polynomially 
with  respect  to  the  synthetic  CFM  confidence  parameter  %!).  In  conuast,  we  prove  that  differential  learning 
via  the  original  forms  of  CFM  [55]  is  inefncient  and/or  unreasonably  slow.  In  the  latter  case,  convergence 
to  the  CFM-maximizing  parameters  proceeds  at  a  rate  that  decreases  exponentially  with  respect  to  the  CFM 
confidence  parameter. 


5.2  Discriminator  Output  Space 

We  precede  our  discussion  of  monotonicity  with  a  number  of  definitions  that  follow  from  a  geometric  view 
of  discriminator  output  space  y .  Recall  from  section  2.2.1,  we  generally  assume  that  discriminator  output 
space  is  infinite  and  uncountable  for  the  C-class  discriminator  (i.e.,  y  =  In  this  section  we  will  assume 
that  each  discriminator  output  is  uncountable  on  a  closed  interval  with  lower  and  upper  bounds  /  and  h: 


5. 2  Discriminator  Output  Space 
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Y  €  y  =  II,  hf  (5.1) 

When  /  -4  — oo  and  h  oo,  (5.1)  is  equivalent  to  y  =  Si*' .  We  therefore  use  the  bounds  I  and  h 
without  loss  of  generality. 

Some  of  the  following  definitions  might  seem  rather  abstract  and  tedious  at  first  reading.  We  encourage 
the  reader  to  persevere,  reading  the  definitions  first  without  dwelling  on  the  associated  mathematical  details, 
which  are  simply  more  formal  expressions  of  what  we  say  in  words.  Those  who  want  to  work  through  the 
mathematical  details  can  go  back  through  the  definitions  a  second  time.  The  concepts  are  all  geometric  and 
rather  simple,  which  should  become  apparent  as  one  proceeds  through  the  first  reading  of  the  definitions. 
Examples  and  figures  help  to  clarify  the  concepts. 

Definition  5.1  The  discriminant  continuum:  Consider  the  classifier  with  the  discriminator  output 
space  y  defined  by  (5.1),  given  the  jth  example  as  its  input.  The  example's  class  label  is  W-'  .  The 
classifier's  discriminant  continuum,  given  (X^,  ,  is  an  imaginary  line  drawn  betM<een  two  particular, 

opposite  vertices  of  y.  The  “incorrect"  vertex  of  y  is  the  point  Vmaima  at  which  the  discriminator  output 
associated  with  the  class  W  >  has  a  minimum  value;  all  other  discriminator  outputs  have  a  maximum  value:^ 

^  {I, 

Yincomcl  —  (yi  •  •••  tVc)  •  ~  1  (5-2) 

I  h ,  otherwise 


The  opposite  “correct"  vertex  of  y  is  the  point  Ycomci  at  which  the  discriminator  output  associated  with 
the  class  has  a  maximum  value;  all  other  discriminator  outputs  have  a  minimum  value: 


=  O’l . >’c> 


Wl  =  Ui 
otherwise 


(5.3) 


The  discriminant  continuum  is  the  line  between  Ymconta  ond  Ycoma  • 

{Y  :  Y  =  a  •  Yconta  +  (1  -  q)  •  Y, ;  0  <  a  <  I  (5.4) 


Remark:  Note  that  the  discriminant  continuum  is  a  notion  that  is  tied  to  specific  examples  of  the  random 
feature  vector:  each  of  these  examples  has  an  associated  class  label  ,  and  this  class  label  determines  the 
specific  mathematical  expression  for  the  discriminant  continuum  via  (5.2)  —  (5.4).  Definitions  5.2  —  5.9  are 
tied  to  specific  examples  of  the  random  feature  vector  in  precisely  the  same  manner. 

^Recall  from  Mction  2.2.4  ihat  we  use  (he  notation  yr  to  denote  the  discriminator  output  associated  with  the  class  =  UJr'.we 
use  Ihe  notation  ^7  to  denote  the  largest  other  discriminalor  output.  We  remind  the  reader  that  we  rely  on  these  notational  conventions 
throughout  the  text. 
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Definition  5.2  Reduced  discriminator  output  space;  Consider  the  classifier  with  the  discriminator 
output  space  y  defined  by  \  5.  J),  given  the  jthexample  asits  input.  The  example's  class  label  is  .  If 
we  re-express  discriminator  output  space  thus 

y  =  >’,  X  ...  K  >v;  y]  =  ii.b].  (5.5) 

reduced  discriminator  output  space,  given  (X^,  W^),  is  the  2-dimensional  sub-space  of  y  comprising 
the  domain  of  the  discriminator  output  yr  (corresponding  to  W-'  =  U)  r )  ond  the  domain  y'r  of 
the  discriminator’s  largest  other  output  y^.  Mathematically,  reduced  discriminator  output  space  for  the 
example/class  label  pair  (X^,  W^}  is  given  by 


X  : 

=  UJt,  3't  G  >’t,  6  >V,  5v  =  max  3'* 

k^T 


(5.6) 


Example  5.1  Figure  5.1  illustrates  reduced  discriminator  output  space  for  a  hypothetical  classifier  with  C 
discriminator  outputs  that  take  on  values  between  zero  and  one  (i.e.,  y  =  (/  =  0,h  =  1]^).  Three 
training  examples  are  projected  onto  the  space  as  gray  dots.  Each  example  X^  elicits  an  output  state 
Q(Xl\0)  =  {g|(X^|S), ...  ,gc(X^|d)}  =  {yi ....  .y-c}  in  the  discriminator.  Given  the  ^h  training 
example,  the  position  of  the  dot  along  the  horizontal  axis  denotes  the  value  of  the  discriminator  output 
3’r  =  I©),  which  corresponds  to  the  example’s  class  label  W*  =  ;  the  position  of  the  dot  along 

the  vertical  axis  denotes  the  value  of  the  largest  other  discriminator  output  yv  =  max^^ir  ^t(X^  1 0) . 


Definition  5.3  The  reduced  discriminant  continuum;  The  reduced  discriminant  continuum  is 
the  projection  of  the  discriminant  continuum  (definition  5.J)  onto  reduced  discriminator  output  space 

(definition  5.2).  Consider  the  classifier  with  the  discriminator  output  space  y  defined  by  (5.1).  Giv'-n  the  jth 

example/class  label  pair  {\l  .W^),  the  discriminator  outputs  yr  and  yv  and  their  corresponding  domains 

yr  and  y'r  are  determined  by  the  example's  class  label  =  Ur  and  the  discriminator  output  state 

G(\’  1 9) .  The  reduced  discriminant  continuum  is  the  line  between  I)  the  point  in  reduced  discriminator 

output  space  for  which  yr  takes  on  its  minimum  value  and  t'’kes  on  its  maximum  value,  and  2)  the 

point  in  reduced  discriminator  output  space  for  which  yr  takes  on  its  maximum  value  and  y>  takes  on  its 

minimum  value.  In  vector  notation,  the  reduced  discriminant  continuum  is  given  by 


Wi  =  Ur,  0  <  o  <  I 


(5.7) 
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Remark:  We  often  abuse  terminology  by  using  the  terms  (reduced)  discriminant  continuum  and  (reduced) 
discriminator  output  space  synonymously.  We  assume  that  the  reader  understands  that  the  two  concepts  are 
inextricably  linked,  so  each  term  implies  the  other. 

Example  5.2  Figure  5. 1  illustrates  the  reduced  discriminant  continuum  on  the  reduced  discriminator  output 
space  of  our  hypothetical  classiHer.  It  is  the  line  between  the  point  (vt  =  O.yv  =  I)  and  the  point 
C'  r  =  1 ,5v  =  0) .  The  line  is  offset  by  dashed  lines  for  clarity,  and  it  is  labelled  “Discriminant  Continuum” 
rather  than  “Reduced  Discriminant  Continuum”  for  the  sake  of  simplicity. 

Remark:  Intuitively,  the  discriminant  continuum  and  its  reduced  counterpart  represent  a  line  between  the 
worst  possible  incorrect  classification  and  the  best  possible  correct  classification  of  an  example.  Our  use 
of  the  terms  “worst”  and  “best”  are  quantitative  in  the  following  sense:  the  worst  possible  incorrect 
classification  occurs  when  Y  =  Y incorrect  — the  discriminator  output  corresponding  to  the  example’s  class 
label  is  minimum  and  all  the  other  outputs  (corresponding  to  incorrect  classifications  of  the  example)  are 
maximum;  the  best  possible  correct  classification  occurs  when  Y  =  Y correct  —  the  discriminator  output 
corresponding  to  the  example’s  class  label  is  maximum  and  all  the  other  outputs  (corresponding  to  incorrect 
classifications  of  the  example)  are  minimum. 

Deflnition  5.4  The  discriminant  boundary;  Consider  the  classifier  with  the  discriminator  output 
space  y  defined  by  (5.1),  given  the  jth  example  as  its  input.  The  example's  class  label  is  WL  The 
discriminant  boundary  is  the  the  set  of  all  discriminator  output  states,  given  TV  ^) ,  for  which  the  output 

y'r  (corresponding  to  the  example’s  class  label  TV  I  =  Ulr )  is  equal  to  the  largest  other  output  Jv  and 
greater  than  or  equal  to  all  other  discriminator  outputs: 

{Y  :  y,  =  5v  n  3V  >  y*  r}  ;  W’  =  Ur  (5.8) 


Definition  5.5  The  reduced  discriminant  boundary:  The  reduced  discriminant  boundary  is  the 

projection  of  the  discriminant  boundary  onto  reduced  discriminator  output  space: 

{(»..v7)  :  yv  =  y7} ;  (5.9) 


Example  5.3  Figure  5.1  illustrates  the  reduced  discriminant  boundary  on  the  reduced  discriminator  output 
space  of  our  hypothetical  classifier.  It  is  the  line  between  the  point  (vr  =  0,y7  =  0)  and  the  point 
O’r  =  I  .yv  =  I).  The  line  is  labelled  “Discriminant  Boundary”  rather  than  “Reduced  Discriminant 
Boundary”  for  the  sake  of  simplicity. 
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Definition  5.6  The  “incorrect”  side  of  discriminator  output  space  yincoma  '■  Consider  the  classifier 
with  the  discriminator  output  space  y  definedby  (5.1),  given  the  jth  example  as  its  input.  The  example’s 
class  label  is  .  The  “incorrect”  side  of  discriminator  output  space,  given  (X^ ,  W^) ,  is  the  set  of  all 
discriminator  output  states  for  which  the  output  y'r  (corresponding  to  the  example’s  class  label  W'  =  UJt) 
is  not  maximal.  Mathematically,  the  incorrect  side  of  discriminator  output  space  is  given  by 

yixcomci  =  {Y  :  yr  <  yt  for  some  *  /  t}  ;  =  Ur  (5.10) 

Definition  5.7  The  “incorrect”  side  of  reduced  discriminator  output  space:  The  incorrect  side  of 
reduced  discriminator  output  space  is  the  projection  of  the  incorrect  side  of  discriminator  output  space  onto 
reduced  discriminator  output  space: 

{()V.5v>  :  3’r  <  5v}  ;  W'  =  U)t  (5.11) 


Incorrect  space:  We  sometimes  use  the  term  '  ‘incorrect  space  ’  ’  to  denote  the  incorrect  side  of  discriminator 
(and  reduced  discriminator)  output  space. 

Example  5.4  Figure  5. 1  illusutites  the  incorrect  side  of  reduced  discriminator  output  space  for  our  hypothet¬ 
ical  classifier.  It  is  the  region  above  and  to  the  left  of  the  reduced  discriminant  boundary. 

Definition  5.8  The  “correct”  side  of  discriminator  output  space  yincnma  '•  Consider  the  classifier 
with  the  discriminator  output  space  y  defined  by  (5. 1),  given  the  jth  example  X'  as  its  input.  The  example’s 
class  label  is  Wf  The  "correct”  side  of  discriminator  output  space,  given  (X^  ,  W^} ,  is  the  set  of  all 
discriminator  output  states  for  which  the  output  y'r  (corresponding  to  the  example’s  class  label  =  Ur) 
is  greater  than  all  other  outputs.  Mathematically,  the  correct  side  of  discriminator  output  space  is  given  by 

ycorrec,  =  {Y  :  yr  >  >*  V*  t}  ;  W'  =  Ur  (5.12) 

Definition  5.9  The  “correct”  side  of  reduced  discriminator  output  space:  The  correct  side  of  reduced 
discriminator  output  space  is  the  projection  of  the  correct  side  of  discriminator  output  space  onto  reduced 
discriminator  output  space: 


{O’r.yr)  :  yv  >  yr} ;  =  u 


(5.13) 


Figure  S.l:  Reduced  discriminator  output  space  for  a  hypothetical  classifier  with  C  outputs  that  take  on 
values  between  zero  and  one.  The  reduced  space  is  2-dimensional:  the  abscissa  corresponds  to  yr .  the 
discriminator  output  corresponding  to  the  class  label  of  the  example  that  the  classifier  is  processing;  the 
ordinate  corresponds  to  5v  i  the  largest  orAer  discriminator  output.  The  discriminator  output  states  generated 
#  by  three  different  hypothetical  examples  are  projected  onto  this  space.  Examples  I  and  3  are  correctly 

classified  since  they  generate  a  discriminator  output  state  in  which  the  output  associated  with  the  example's 
class  is  larger  than  all  other  outputs  (i.e.,  Vr  >  Example  2  is  incorrectly  classified  since  it  generates  a 
discriminator  output  state  in  which  the  output  associated  with  the  example’s  class  is  smaller  than  at  least  one 
other  output  (i.e.,  jv  <  5^). 


Correct  space:  We  sometimes  use  the  term  ‘ '-correct  space"  to  denote  the  correct  side  of  discriminator 
(and  reduced  discriminator)  output  space. 


Example  5.5  Figure  S.  1  illustrates  the  correct  side  of  reduced  discriminator  output  space  for  our  hypothetical 
classifier.  It  is  the  region  below  and  to  the  right  of  the  reduced  discriminant  boundary. 
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5.2.1  The  Discriminant  Differential  Sr ,  the  Reduced  Discriminant  Continuum,  and 
the  Reduced  Discriminant  Boundary 

Recall  from  (2.22)  that  the  discriminant  differential  Sr  for  the  example/class  label  pair  (X^  ,  W^}  is  given 
by 


(5,(X>|0)  =  gr(X>|»)  -  maxg*(X>|0);  =  CUr  (5.14) 


Note  that  the  set  of  all  reduced  discriminator  output  states  corresponding  to  a  specific  value  of  (^r  is  given  by 

{Cvr.5v>  :  yr  =  5v  +  ^t}  ;  =  u>t,  (5.15) 

which,  but  for  a  constant,  is  identical  to  the  expression  for  the  reduced  discriminant  boundary  in  (S.9).  Thus, 
all  examples  that  generate  the  same  discriminant  differential  lie  on  a  line  that  is  parallel  to  the  reduced 
discriminant  continuum  in  reduced  discriminator  output  space. 

Indeed,  Sr  is  the  Euclidean  distance  between  the  classifier's  reduced  discriminator  output  state  and  the 
reduced  discriminant  boundary;  equivalently,  it  is  the  projection  of  the  classifier’s  reduced  discriminator 
output  state  onto  the  reduced  discriminant  continuum.  Figure  5.2  illustrates  these  relationships  for  the 
reduced  discriminator  output  space  and  examples  shown  in  figure  5. 1 .  The  lower  left  part  of  the  figure  shows 
the  domain  of  :  the  diagonal  gray  lines  at  intervals  of  0.2  project  up  onto  reduced  discriminator  output 
space  and  the  reduced  discriminant  continuum.  Note  that  these  lines  are  parallel  to  the  reduced  discriminant 
boundary. 


Positive  discriminant  differentiab  Sr  indicate  correct  classifications:  Positive  values  of  Sr  corre¬ 
spond  to  correct  classifications  (i.e.,  classifier  output  states  that  lie  in  correct  space).  Non-positive  values 
of  Sr  correspond  to  incorrect  classifications  (i.e.,  classifier  output  states  that  lie  in  incorrect  space). 


Since  the  classifier  represented  in  figure  5.2  has  discriminator  outputs  that  are  bounded  on  [0,1],  ($r  is 
bounded  ort  [-1,1]. 

The  lower  left  of  figure  5.2  also  shows  (7  t  .  V-'] .  ihc  CFM  of  Sr ,  given  a  confidence  parameter  value 
of  0.4.  Since  CFM  is  a  strictly  non-decreasing  function  of  and  all  reduced  discriminator  output  states 
generating  a  specific  value  of  Sr  lie  on  a  line  that  is  parallel  to  the  reduced  discriminant  boundary,  the  CFM 
objective  function  has  contours  of  constant  value  that  are  parallel  to  the  reduced  discriminant  boundary. 
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Figure  S.2:  An  illustration  of  the  discriminant  differential  6r  and  its  relationship  to  reduced  discriminator 
output  space.  Examples  that  generate  the  same  discriminant  differential  lie  on  a  line  that  is  parallel  to  the 
reduced  discriminant  boundary.  Since  CFM  (i.e.,  (7  [($r .  V’] )  '^  strictly  non-decreasing  function  of  <$r  the 
contours  of  constant  CFM  lie  parallel  to  the  reduced  discriminant  boundary  —  a  necessary  characteristic  of 
the  monotonic  objective  hinction. 
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5.3  Objective  Function  Monotonicity  and  Learning  Efficiency 

The  objective  function  must  be  monotonic  if  it  is  to  engender  efficient  learning  regardless  of  the  choice  of 
hypothesis  class. 

Definition  5.10  A  monotonic  objective  function:  A  monotonic  objective  function  is  always  a  strictly 
decreasing  (or  increasing)  junction  of  the  classifier’s  empirical  training  sample  error  rate. 

Remark:  If  the  monotonic  objective  function  is  a  strictly  decreasing  function  of  the  classifier’s  empirical 
training  sample  error  rate  (e.g.,  CFM),  then  learning  is  accomplished  by  maximizing  the  objective  function 
with  respect  to  the  discriminator  parameters,  given  the  training  sample.  If  the  monotonic  objective  function 
is  a  strictly  increasing  function  of  the  classifier’s  empirical  training  sample  error  rate  (e.g.,  one  minus 
CFM),  then  learning  is  accomplished  by  minimizing  the  objective  function  with  respect  to  the  discriminator 
parameters,  given  the  training  sample. 

A  necessary  condition  for  monotonicity:  We  use  the  notation  $  (Y)  to  denote  the  value  of  the  objective 
function,  given  the  discriminator  output  state  Y .  In  order  for  $  to  be  monotonic  on  y,  it  must  exhibit  a 
more  optimal  value  for  every  discriminator  output  state  in  correct  space  than  it  exhibits  for  any  value  in 
incorrect  space.  Mathematically,  $  must  satisfy 

$  (Y)  is  more  optimal  than  $  (Y')  V  (Y ,  Y')  ;  Y  G  ,  Y'  G  yi„corrrc,  (5. 1 6) 

in  order  to  be  monotonic  on  y.  We  stress  that  (5. 16)  is  a  necessary  condition  for  monotonicity,  but  it  is  not 
sufficient.  The  sufficient  conditions  for  true  nwnotonicity  are  discussed  in  section  5.3.6. 


Clearly,  (5.16)  must  hold  if  the  objective  function  is  monotonic,  otherwise  some  discriminator  output  states 
in  correct  space  will  generate  less  optimal  values  of  the  objective  function  than  other  output  states  in  incorrect 
space.  If  $  fails  to  satisfy  (5.16)  it  is  surely  non-monotonic. 

Definition  5.11  A  non-monotonk  objective  ftinction;  A  non-monotonic  objective  function  is  not  always 
a  strictly  decreasing  (or  increasing)  function  of  the  classifier's  empirical  training  sample  error  rate. 

Error  measures  are  non-monotonic  because  their  contours  of  constant  value  are  not  parallel  to  the 
discriminant  boundary.  They  become  increasingly  non-monotonic  —  and  probabilistic  learning  becomes 
increasingly  inefficient  —  as  the  number  of  classes  C  increases.  In  order  to  prove  this,  we  need  to  define 
some  sub-sets  of  discriminator  output  space,  define  some  associated  measures,  and  present  a  few  lemmas. 
We  provide  examples  in  support  of  the  definitions.  These  examples  are  associated  with  a  classifier  having 
C  =  2  discriminant  functions  on  the  space  y  =  [I  =  0,h  =  1]^;  the  classifier  learns  probabilistically 
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via  the  MSE  objective  function,  and  is  illustrated  in  figure  S.3.  Contours  of  constant  MSE  are  projected  onto 
discriminator  output  space,  forming  the  concentric  circular  arcs  in  the  figure. 

Cardinality:  We  denote  the  cardinality  of  a  set  (i.e.,  the  number  of  elements  in  the  set)  by  |  •  |.  When  the  set 
is  an  uncountable  space  —  as  y  is  in  (5.1)  —  we  express  cardinalities  as  volumes.  Note  that 

|y|  =  (/I  -  If,  (5.17) 


given  y  in  (5.1). 


Deflnition  5.12  Correct  fraction  of  discriminator  output  space  CT :  We  denote  the  correct  fraction  of 
discriminator  output  space  by  CT.  If  we  think  of  discriminator  output  space  and  its  associated  sub-spaces  as 
sets  of  points,  CT  is  the  ratio  of  two  cardinalities  associated  with  two  sets:  the  numerator  is  the  cardinality 
of  the  set  of  all  discriminator  output  states  in  correct  space  ycomoi  the  denominator  is  the  cardinality  of 
discriminator  output  space  y. 


CT 


A  |ycnrfifcf| 

"  \y\ 


(5.18) 


Lemma  5.1  The  correct  fraction  of  discriminator  output  space  ycoma  decreases  as  C~'  for  all  C  >  2. 
Proof  :  Given  the  expression  for  correct  space  in  (5. 1 2),  its  cardinality  (i.e.,  volume)  is  given  by 


|ycorrfcf|  —  j  da\  •••  dac—\  dy\ 


C  —  I  imepal  lenns 


(5.19) 


where  ai  . . .  qc-i  are  simply  the  C  —  I  dummy  variables  of  integration  for  the  discriminator  outputs  not 
associated  with  yr  •  By  (5.1)  and  (5.17)  —  (5.19),  CT  is  therefore 


■ 


(5.20) 


Example  5.6  For  the  C  =  2-class  task  depicted  in  figure  5.3,  correct  space  comprises  one  half  of 
discriminator  output  space  (i.e.,  CT"  =  ^  =  ^ ). 
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Definition  5.13  Incorrect  fraction  of  discriminator  output  space  XT :  We  denote  the  incorrect 

fraction  of  discriminator  output  space  by  IT.  It  is  the  ratio  of  two  cardinalities:  the  numerator  is  the 
cardinality  of  the  set  of  all  discriminator  output  states  in  incorrect  space  ymcomcti  denominator  is  the 
cardinality  of  discriminator  output  space  y. 


or,  equivalently. 


|y« 


A  \^incorrea\ 


iyi 


XT  =  \  -  CT 


(5.21) 


(5.22) 


Lemma  5.2  The  incorrect  fraction  of  discriminator  output  space  y  incorrect  increases  as  for  all  C  >  2. 
Proof  :  The  proof  follows  immediately  from  (5.20)  and  (S.22).  I 

Example  5.7  For  the  C  =  2-class  task  depicted  in  figure  5.3,  incorrect  space  comprises  one  half  of 
discriminator  output  space  (i.e.,  XT  =  =  2  )• 

Definition  5.14  Non-monotonic  correct  fraction  of  discriminator  output  space  CT^mmo  '■  denote 
the  non-monotonic  correct  fraction  of  discriminator  output  space  by  CT^ono-  It  w  the  ratio  of  two 
cardinalities:  the  numerator  is  the  cardinality  of  the  set  of  all  discriminator  output  states  in  correct  space 
y comet  for  which  there  is  at  least  one  discriminator  output  state  in  incorrect  space  yincnrrra  that  generates 
a  more  optimal  objective  function  value;  the  denominator  is  the  cardinality  of  discriminator  output  space. 


CT 


A  |{Y  ;  Ye  y  comet  n  3Y'  €  yincomet  s.t.  $(Y)  is  less  Optimal  than  ^  {Y')}\ 

|y| 

(5.23) 


Example  5.8  For  the  C  —  2-class  task  depicted  in  figure  5.3,  the  non-monotonic  region  of  correct  space  is 
the  light  gray  shaded  region  below  and  to  the  right  of  the  discriminant  boundary.  Tbe  fraction  of  discriminator 
output  space  that  this  region  encompasses  is  CT^mmo  •  in  this  particular  case,  CT-,oumo  —  0. 107. 

Definition  5.15  Monotonk  correct  fraction  of  discriminator  output  space  CTmooo  •  lYe  denote  the 
monotonic  correct  fraction  of  discriminator  output  space  by  CTmono-  It  is  the  ratio  of  two  cardinalities:  the 
numerator  is  the  cardinality  of  the  set  of  all  discriminator  output  states  in  correct  space  ycoma  for  which 
all  discriminator  output  states  in  incorrect  space  ymeomet  generate  less  optimal  objective  function  values; 
the  denominator  is  the  cardinality  of  discriminator  output  space. 
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A  |{Y  :  Y  €  ycorrrc,  H  ^  (Y)  i.s  more  Optimal  than  ^  (Y")  VY”  €  y,v„rr,ct}| 


(5.24) 


or,  equivalently. 


=  CT  -  CT. 


(5.25) 


Example  5.9  For  the  C  =  2-class  task  depicted  in  Figure  5.3,  the  monotonic  region  of  correct  space  is 
the  unshaded  region  in  correct  space  (i.e.,  the  unshaded  region  below  and  to  the  right  of  the  discriminant 
boundary).  The  fraction  of  discriminator  output  space  that  this  region  encompasses  is  C!F„m„ :  in  this 
particular  case,  =  CT  -  CT^„„  =  0.393. 

Deflnition  5.16  Non-monotonic  incorrect  fraction  of  discriminator  output  space  XT -mmo ' 
denote  the  non-monotonic  incorrect  fraction  of  discriminator  output  space  by  XT It  tbe  ratio  of  two 
cardinalities:  the  numerator  is  the  cardinality  of  the  set  of  all  discriminator  output  states  in  incorrect  space 
yiitcomet  for  which  there  is  at  least  one  discrimimtor  output  State  in  correct  Space  ycoma  that  generates  a 
less  optimal  objective  fimction  value;  the  denominator  is  the  cardinality  of  discriminator  output  space. 

rr  ^  KY  Y  €  y  incorrect  n  3Y'"  €  y correct  s.t.  $ (Y)  is  more  Optimal  than  ^  {Y"')}\ 

^  ^  —  1^1 

(5.26) 

Example  5.10  For  the  C  =  2-class  task  depicted  in  figure  5.3,  the  non-monotonic  region  of  incorrect  space 
is  the  dark  gray  shaded  region  above  and  to  the  left  of  the  discriminant  boundary.  The  fraction  of  discriminator 
output  space  that  this  region  encompasses  is  XT^mmo  '■  in  this  particular  case,  XT^m„  ^  0.285. 

Definition  5.17  Monotonic  incorrect  fraction  of  discriminator  output  space  XTnumo '  lYe  denote  die 
monotonic  incorrect  fraction  of  discriminator  ou^ut  space  by  XT  mom-  h  ^  ratio  of  two  cardinalities: 
the  numerator  is  the  cardinality  of  the  set  of  all  discriminator  output  states  in  incorrect  space  y incorrect 
for  which  all  discriminator  output  states  in  correct  space  ycorrea  generate  more  optimal  objective  function 
values:  the  denominator  is  the  cardinality  of  discriminator  output  space. 

^  I  { Y  I  Y  €  y incorrect  ^  is  more  optimal  than  $  (Y)  V  Y""  €  ycorrea  }  | 

monft  —  |y| 


(5.27) 
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or,  equivalently, 


=  ir  -  iT- 


(5.28) 


Example  5.11  For  the  C  =■  2-class  task  depicted  in  Figure  S.3,  the  monotonic  region  of  incorrect  space  is 
the  unshaded  region  in  incorrect  space  (i.e.,  the  unshaded  region  above  and  to  the  left  of  the  discriminant 
boundary).  The  fraction  of  discriminator  output  space  that  this  region  encompasses  is  XTmmn  ’■  in  this 
particular  case,  1T„„„  =  TF  -  =  0.215. 


It  should  be  clear  from  definitions  5.14  —5.17  that  the  monotonic  and  non-monotonic  fractions  of 
discriminator  output  space  sum  to  one  as  follows: 


XT -.mono  +  XT  mono  +  CT  -tmono  +  CT  mono  —  1 

- - - - -  V - - - - - - 


(5.29) 


Definition  5.18  Monotonic  fraction  of  discriminator  output  space  MT :  We  denote  the  monotonic 
fraction  of  discriminator  output  space  by  MT.  It  is  the  sum  of  the  monotonic  correct  and  incorrect  fractions: 


MT  =  XTmono  +  CT  mono 
s.t.  0  <  MT  <  1 


(5.30) 


Example  5.12  For  the  C  =  2-class  task  depicted  in  Figure  5.3,  the  monotonic  region  of  discriminator 
output  space  is  the  unshaded  region  (i.e.,  the  combined  unshaded  regions  on  both  sides  of  the  discriminant 
boundary).  The  fraction  of  discriminator  output  space  that  this  region  encompasses  is  MT :  in  this  particular 
case,  MT  Sf  0.608. 

We  measure  an  objective  function's  monotonicity  —  or  lack  thereof —  by  its  monotonic  fraction  MT , 
given  discriminator  output  space  3^.  If  the  objective  function  is  monotonic,  the  monotonic  fraction  MT 
is  unity;  likewise,  the  monotonic  fractions  of  incorrect  {XT mom )  and  comet  (CTmom )  spaces  are  equal  to 
XT  and  CT,  respectively.  If  the  objective  function  is  non-monotonic,  MT  is  less  than  unity;  likewise, 
XT  mom  and/or  CTmom  are  less  than  XT  and/or  CT ,  respectively.  Simply  put,  objective  functions  with 
lower  values  of  MT ,  XTmom,«od  CTmom  are  increasingly  non-monotonic. 

Intuitively,  one  can  view  the  monotonic  fraction  MT  as  a  kind  of  correlation  coefficient  between  the 
act  of  optimizing  the  objective  function  and  the  act  of  minimizing  the  classifier’s  empirical  training  sample 
error  rate,  absent  any  specific  knowledge  regarding  whether  or  not  the  classifier’s  hypothesis  class  is  a 
proper  parametric  model  of  the  feature  vector.  If  MT  =  I,  every  discriminator  output  state  in  correct 
space  generates  a  more  optimal  value  of  the  objective  function  than  any  output  state  in  incorrect  space;  as 
a  result,  optimizing  the  objective  function  is  monotonically  related  to  minimizing  the  training  sample  error 
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P  Discriminator  Output  Space  (MSE.  C  =  2)  T?- 


Figure  5.3:  The  MSE  objective  function  is  non-monotonic.  Example  3  generates  higher  (less  optimal) 
MSE  than  example  2,  even  though  example  3  Is  correctly  classified,  whereas  example  2  is  not.  For  every 
discriminator  output  state  in  the  light  gray  shaded  region  of  correct  space  there  is  at  least  one  output  state  in  the 
dark  gray  shaded  region  of  incorrect  space  with  lower  MSE.  The  figure  depicts  discriminator  output  space  for 
a  hypothetical  C  =  2-class  task  in  which  the  discriminator’s  outputs  are  bounded  on  y  =  [I  =  0,h  —  I]^. 
Since  the  classifier  has  two  discriminant  functions,  discriminator  output  space  and  reduced  discriminator 
output  space  are  one  and  the  same. 


rate,  regardless  of  the  choice  of  hypothesis  class.  If,  on  the  other  hand,  <  I,  some  discriminator 
output  states  in  correct  space  generate  less  optimal  values  of  the  objective  function  than  other  output  states 
in  incorrect  space;  as  a  result,  optimizing  the  objective  function  does  not  necessarily  minimize  the  training 
sample  error  rate  —  a  phenomenon  we  discuss  further  in  section  5.3.5. 

5.3.1  MAE  is  Non-Monotonic 

The  mean  absolute  error  (MAE)  measure^  discussed  in  section  2.3.3  is  the  sum  of  the  absolute  difference 
between  each  discriminator  output  yi  and  its  target  value  r,- : 

^Recall  that  the  mean  absolute  error  measure  is  known  by  other  names  such  as  least  absolute  error  (LAE)  and  least  absolute  deviation 
(LAD,  eg..  19)). 
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c 

MAE  =  ^ly,,T,]  (5.31) 

i=l 

Given  the  training  example/class  label  pair  ,  W^) ,  the  target  value  for  the  output  yr  (corresponding  to 
the  class  label  )  is  D,  and  the  target  values  for  all  the  other  outputs  are  ->D : 


(b’l.T;) 


h  —yi,  =  (jJr  =  iOi  s.t.  Tj  =  D  =  h 

D 

^  UJy  s.t.  Tj  =  -iD  =  / 

-.0 


(5.32) 


In  the  following  arguments,  we  assume  that  yw  is  always  yi  in  order  to  simplify  notation.  Under  this 
condition,  (5.3 1 )  and  (5.32)  reduce  to 


c 

MAE  =  h-yi  +'Z,{yj-l) 

/=2 

C 

=  h  -  yt  -  {C  -  \)  .  I +J2yj  (5-33) 

/=2 

The  maximum  MAE  generated  by  a  correctly  classified  example  of  Ufi  occurs  at  the  vertex  of  correct 
space  farthest  from  Ycoma  •  This  is  the  output  state  in  which  yi  =  h  and  all  the  other  discriminator  outputs 
are  smaller  by  an  infinitesimally  small  positive  value  e : 


max 


MAE  correct  =  lim^  h  —  ^  |  (^  ~  f)  —  / 1  ;  f  >  0 

Vi  /=2 


<  (C  -  \){h  -  I) 


(5.34) 

(5.35) 


Thus,  (5.35)  defines  the  “inner”  boundary  value  of  MAE  in  monotonic  incorrect  space.  If  is  an  example 
of  Ci2| ,  it  is  surely  misclassified  if  MAE  >  (C  —  1 )  (h  —  /):  by  (5.33)  and  (5.35),  the  example  is  surely 
misclassified  if 


c 

)’■  <  E  -  2)  •  *  (5.36) 

The  minimum  MAE  generated  by  an  incorrectly  classified  example  of  ijJ\  occurs  when  >v  (i-C-.  >1 )  is 
infinitesimally  smaller  than  Jv  and  all  the  other  discriminator  outputs  are  minimal.  By  (5.33), 


Monotonicity 


129 


min  MAE  incorreci  =  iini  h  —  V|  +  y\  +  f  — 1\  f  >  0 


=  h  ~  I 


(5.37) 


Thus,  the  minimum  value  of  MAE  for  an  incorrectly  classified  example  occurs  when  yv  (i.e.,  yi )  equals  5v . 
and  (5.37)  defines  the  "outer”  boundary  value  of  MAE  in  monotonic  correct  space.  If  is  an  example  of 
(jJ[ ,  it  is  surely  correctly  classified  if  MAE  <  (/»  —  /)  such  that 


VI  >  E  -  (C  -2)  I 
y=2 


(5.38) 


Note  that  when  C  =  2,  (5.33) reduces  to 


MAEf  =  2 


constant 


(5.39) 


and,  by  (S.38),  a  training  example  is  5Hre/y  correctly  classified  if  ($r  >  0.  MAEc  =  2  is  therefore  a  strictly 
decreasing  function  of  Sr .  In  this  sense,  the  MAE  objective  function  is  a  special  (and  so  far  as  we  know, 
unique)  error  measure:  when  the  discriminator  has  two  discriminant  functions  associated  with  a  C  =  2-class 
leaming/classification  task,  the  contours  of  constant  MAE  lie  parallel  to  the  discriminant  boundary,  as  shown 
in  figure  5.4.  Since  there  is  no  discriminator  output  state  in  correct  space  that  generates  greater  MAE  than 
any  discriminator  output  state  in  incorrect  space,  (5. 1 6)  is  satisfied,  and  the  monotonic  fraction  —  which  we 
denote  by  MAEA4/'(C  =  2)  — is  unity. 

However,theMAEobJective  function  is,  by  (5.16),  (5.35),  and(5.37),  non-monotonic  for  C  >  2.  Indeed, 
section  1. 1  proves  that  the  monotonic  fractions  of  the  MAE  objective  function  decrease  with  increasing  values 
of  C  according  to  the  following  formulae,  in  which  r(  • )  denotes  the  gamma  function  (e.g.,  [80,  pp. 
A76-A77]): 


MAEI.F^„(C) 

MAEC.F,,^„(C) 


I 


r(c  -I- 1) 
1 

r(c  -I- 1) 
2 

no  -I- 1) 


(5.40) 

(5.41) 


/.  MAE  M^(C) 


(5.42) 
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/ 

^  / 


Treintni  Eximple 


Figure  5.4;  The  MAE  objective  function  can  be  monotonic  if  and  only  if  the  number  of  classes  is  C  =  2;  this 
is  because  MAE  can  be  expressed  as  a  strictly  decreasing  function  of  6r  =  yv  —  5v  ■  As  a  result,  the  contours 
of  constant  MAE  are  parallel  to  the  discriminant  boundary,  a  necessary  condition  for  monotonicity.  The  figure 
depicts  discriminator  output  space  for  a  hypothetical  C  --  2-class  task  in  which  the  discriminator's  outputs 
are  bounded  on  y  —  [I  =  0,h  =  1]^.  Since  the  classifier  has  two  discriminant  functions,  discriminator 
output  space  and  reduced  discriminator  output  space  are  one  and  the  same. 


5.3  Monoton  icitv 
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Figure  5.5:  The  MAE  objective  function  is  increasingly  non-monotonic  as  C  increases.  Output  yi  is  yr  i 
the  output  corresponding  to  the  class  label  for  the  three  training  examples  shown.  Contours  of  constant  MAE 
are  projected  onto  the  traunding  faces  of  discriminator  output  space.  White  denotes  the  monotonic  regions 
of  discriminator  output  space,  light  gray  denotes  the  non-monotonic  region  of  correct  space,  and  dark  gray 
denotes  the  non-monotonic  region  of  incorrect  space."  Example  1  is  correctly  classified,  yet  it  generates  a 
higher  (less  optimal)  value  of  MAE  than  examples  2  and  3,  which  are  both  incorrectly  classified.  Left:  The 
discriminator  output  space  from  the  perspective  of  Y„rrra  ■  Right;  The  discriminator  output  space  from  the 
perspective  of  The  monotonic  fractions  of  incorrect  and  correct  spaces  are  so  the  monotonic 

fraction  of  discriminator  output  space  is 


‘'The  light  gray  shading  underneath  the  cubic  form  of  the  discriminator  output  space  is  an  imaginary  shadow;  it  helps  to  clarify  the 
cube's  orientation. 


Example 5.13  Figure  5.5  illustrates  the  non-monotonic  nature  of  MAE  when  C  =  3.  The  figure 
shows  two  views  of  discriminator  output  space  (which  is  y  =  [/  =  0,/i  =  1]’  for  the  purpose  of 
illustration).  As  in  previous  figures,  white  denotes  the  morotonic  regions  of  discriminator  output  space, 
light  gray  denotes  the  non-monotonic  region  of  correct  space,  and  dark  gray  denotes  the  non-monotonic 
region  of  incorrect  space.  The  left-hand  figure  shows  discriminator  output  space  from  the  perspective 
of  ’Vcorrra  =  (vi  =  Lv:  =  0,y.i  =  0};  the  right-hand  figure  shows  discriminator  output  space  from 
the  perspective  of  Yinconect  =  <yi  =  0,y2  =  l.y.r  =  1).  By  (5.40)  —(5.42),  MAEI.^,**„(3)  = 
MAEC.F„^„(3)  =  |,and  MAEA47’(3)  =  5 .  Example  1  (i.e.,  X'  generates  a  discriminator  output  state 
of  Q(\ '  I  =  Y'  =  (1,  .9,  .9) ,  so  it  is  correctly  classified;  MAE(Y' )  =  1.8 .  Example  2  generates  a 
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discriminator  output  state  of  =  (.8,  .2, 1) ,  so  it  is  incorrectly  classified;  MAE(Y^)  =  1.4. Examples 
generates  a  discriminator  output  state  of  Y-'  =  (.4, 0,.6) ,  so  it  is  incorrectly  classified;  MAE(Y’)  =  1.2. 
Thus,  we  have  the  particularly  undesireable  situation  in  which  the  correctly  classified  example  generates  a 
less  optimal  value  of  MAE  than  the  two  incorrectly  classified  examples  generate  —  a  clear  manifestation  of 
the  MAE  objective  function’s  non-monotonic  nature  (C  >3). 

By  (5.40)  —  (5.42),  all  the  MAE  monotonic  fractions  go  to  zero  as  the  number  of  classes  C  grows  large: 


lime— »oc  MAEI^  mono((^)  —  0 

limc^oo  MAECJ^„^„(C)  =  0  (5.43) 

limc-,c  MAE.V(jr(C)  =  0 

Moreover,  MAEMJ^(C),  MAE ,  and  MAEI.r„„„„(C)  decrease  super-exponentially  as  C 
increases.  For  example,  when  the  number  of  classes  is  ten,  the  monotonic  fractions  are  quite  small: 


MAEX/'(10)  =  5.51  X  I0-’ 
MAEI.F„„„„(10)  =  MAEC.F„„„(10)  =  2.75  x  IQ-’ 


(5.44) 


5.3.2  MSE  is  Non>Monotonic 

The  mean  squared  error  (MSE)  measure  discussed  in  section  2.3.2  is  given  by 

c 

MSE  =  (5-45) 


where 


As  in  the  preceding  section,  we  assume  that  yv  i' 
condition,  (5.45)  and  (5.46)  reduce  to 


i=I 

=  UJt  =  s.t.  Tj  =  D  =  h 

(5.46) 

^  (jJr  s.t.  Ti  =  -.D  =  I 
always  vi  in  order  to  simplify  notation.  Under  this 


MSE 


{h  -  y^)^  +  Yi  {yj  -  o’ 

/=2 


(5.47) 
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The  maximum  MSE  generated  by  a  correctly  classiHed  example  of  occurs  at  the  vertex  of  correct 
space  farthest  from  Y comet  ■  This  is  the  output  state  in  which  y\  =  h  and  all  the  other  discriminator  outputs 
are  smaller  by  an  infinitesimally  small  positive  value  6 : 


max  MSE  correct  =  lim  x 
c-»o+  2 


7=2 


;  f  >  0  (5.48) 


(5.49) 


Thus,  (5.49)  defines  the  “inner”  boundary  value  of  MSE  in  monotonic  incorrect  space.  If  is  an  example 
of  UJ\ ,  it  is  surely  misclassified  if  MSE  >  ^  (h  —  l)^:  by  (5.47)  and  (5.49),  the  example  is  surely 

misclassified  if 


y\  <  h  - 


iC-  l)(A- /)'  -  E(>V  -  O' 

7=2 


(5.50) 


The  minimum  MSE  generated  by  an  incorrectly  classified  example  of  UJ\  occurs  when  yv  (i-c-.  yi )  is 
infinitesimally  smaller  than  y7  and  all  the  other  discriminator  outputs  are  minimal.  By  (5.47), 


min  MSE  incorrect  = 


lim  - 
(-♦0  2 


h  -  Vi 

V, 


( 

y\  +  e  -  / 


2-1 


2  [(*  -  .Vi)'  +  O’l  -  o']  . 


€  >  0 


(5.51) 


which  is  minimal  when 


jv  =  yv  = 

Vl 


h  +  I 


(5.52) 


Thus,  (5.5 1 )  reduces  to 


min  MSE  incorrect  =  -7  (/i  -  O'  (5.53) 

4 

Equation  (5.53)  defines  the  "outer”  boundary  value  of  MSE  in  monotonic  correct  space.  If  is  an  example 

of  Uf\ ,  it  is  Jirre/v  correctly  classified  if  MSE  <  j  (A  —  O'  such  that 
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3’i  >  h  - 


>=2 


(5.54) 


The  MSE  objective  function  is,  by  (5.16),  (5.49),  and  (5.53),  non-monotonic  for  C  >  2.  Indeed, 
section  1.2  proves  that  the  monotonic  fractions  of  the  MSE  objective  function  decrease  with  increasing  values 
of  C  according  to  the  following  formulae; 


MSEI.F«,*,(C)  < 

1 

(5.55) 

r(c  -1- 1) 

MSEC.F„,„,„(C)  = 

(f)* 

(5.56) 

nf  + 1) 

.-.  MSEAt.F(C)  < 

(S)*  ,  1 

(5.57) 

r(f  -1- 1)  r(C  -1- 1) 

Note  that  (5.57)  is  loosely  bounded  from  above  by  for  both  small  and  large  C . 

Example  5.14  Figure  5.3  illustrates  the  non-monotonic  nature  of  MSE  when  C  =  2  (,y  =  [I  =  0,h  =  1]^ 
for  the  purpose  of  illustration).  By  (5.56),  MSEC.FM«,n(2)  =  .393.  The  fraction  MSEI7’*„w.(2)  can 
be  computed  exactly,  obviating  the  need  to  use  the  less  precise  bound  of  (5.55):  it  is  simply  one  fourth 
the  area  of  a  circle  of  unit  radius,  minus  one  half  (i.e.,  MSEI7’to»„„(2)  =  .285).  Thus,  by  (5.30), 
MSEX/'(2)  =  .678.  Example  I  generates  a  discriminator  output  state  of  Y'  =  (1,0),  so  it  is 
correctly  classified;  MSE(Y')  =  0 .  Example  2  generates  a  discriminator  output  state  of  Y^  =  (.45,  .55), 
so  it  is  incorrectly  classified;  MSE(Y^)  =  .30.  Example  3  generates  a  discriminator  output  state  of 
Y*  =  (.93,  .85) ,  so  it  is  correctly  classified;  MSE(Y’)  =  .36 .  The  MSE  generated  by  example  I  is 
minimal.  However,  the  incorrectly  classified  example  2  generates  a  more  optimal  value  of  MSE  than  the 
correctly  classified  example  3  generates. . . 


Example  5.15  Figure  5.6  illustrates  the  non-monotonic  nature  of  MSE  when  C  =  3 .  The  figure  shows  two 
views  of  discriminator  output  space  (which  is  y  =  [/  =  0,6  =  1]’  for  the  purpose  of  illustration).  The 
left-hand  figure  shows  discriminator  output  space  from  the  perspective  of  "Vcormt ;  the  right-hand  figure  shows 
discriminator  output  space  from  the  perspective  of  Yinoorrrcr-  By  (5.55)  —  (5.57),  MSEI.?'*,«v,(3)  < 
MSEC.rmmo(3)  =  .185,  and  MSEAf7'(3)  <  .352.  Example  I  generates  a  discriminator  output  state  of 
Y'  =  (l,.922,.922),  so  it  is  correctly  classified;  MSE(Y')  =  .85 .  Example  2  generates  a  discriminator 
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Figure  5.6;  The  MSB  objective  function  is  increasingly  non-monotonic  as  C  increases.  Output  yt  is  yr .  the 
output  corresponding  to  the  class  label  for  the  three  training  examples  shown.  Contours  of  constant  MSB 
are  projected  onto  the  bounding  faces  of  discriminator  output  space.  White  denotes  the  monotonic  regions 
of  discriminator  output  space,  light  gray  denotes  the  non-monotonic  region  of  correct  space,  and  dark  gray 
denotes  the  non-monotonic  region  of  incorrect  space."  Example  I  is  correctly  classified,  yet  it  generates  a 
higher  (less  optimal)  value  of  MSE  than  examples  2  and  3,  which  are  both  incorrectly  classified.  Left:  The 
discriminator  output  space  from  the  perspective  of  Ycoma  •  Right:  The  discriminator  output  space  from  the 
perspective  of  Ymcomci-  The  monotonic  fraction  of  incorrect  space  is  less  than  the  monotonic  fraction  of 
correct  space  is  .185;  the  monotonic  fraction  of  discriminator  output  space  is  therefore  less  than  .352. 

"The  light  gray  shading  underneath  the  cubic  form  of  the  discriminator  output  space  is  an  imaginary  shadow;  it  helps  to  clarify  the 
cube's  orientation. 


output  state  of  =  (.508,  .284, 1) ,  so  it  is  incorrectly  classified;  MSE(Y^)  =  .66 .  Example  3  generates 
a  discriminator  output  state  of  Y^  =  (.399,0,  .601),  so  it  is  incorrectly  classified;  MSE(Y^)  =  .36.  Thus, 
the  correctly  classified  example  generates  a  less  optimal  value  of  MSE  than  the  two  incorrectly  classified 
examples  generate;  like  MAE,  the  MSE  objective  function  is  non-monotonic  (C  >  2 ). 

By  (5.55)  —  (5.57),  all  the  MSE  monotonic  fractions  go  to  zero  as  the  number  of  classes  C  grows  large: 

lime— foo  MSEX.F ~  0 

limc-^oo  MSECJ^^iC)  =  0  (5.58) 

limc-»oo  MSEA4.F(C)  =  0 


Like  their  MAE  counterparts,  MSE A4/'(C),  MSEC7^,w„„(C) ,  and  decrease  super- 
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exponentially  as  C  increases,  although  MSEA^J'(C)  is  0[[r(^  +  1)]“']  whereas  MAEMiF{C)  is 
O  [(r(C  +  1)1“'].  Nevertheless,  when  the  number  of  classes  is  ten.  the  monotonic  fractions  are  still  quite 
small: 


MSEI.r„,„„„(10)  <  2.75  X  10“’ 

MSEC/'„„„„(I0)  =  7.78  X  10“''  (5.59) 

MSEAf/'(10)  <  7.81  X  10“' 

5.3.3  The  Kullback-Leibler  Information  Distance  is  Non-Monotonic 

The  Kullback-Leibler  information  distance  (a.k.a.  cross-entropy  —  CE)  discussed  in  section  2.3.2  is  given 
by 


where 


CE  =  53  (5.60) 

(=i 


e[y/.T,) 


-log 

-log 


yvf  =  Ur  =  Ui  s.t.  Ti  =  D  =  h 


^  Ur  s.t.  Ti  =  -,D  =  I 


(5.61) 


t  \  D  / 

As  in  the  preceding  two  sections,  we  assume  that  yr  is  always  vi  in  order  to  simplify  notation.  Under  this 
condition,  (5.60)  and  (5.61 )  reduce  to 


c 

CE  =  -logO'i  -  /)  -  53 

7=2 

The  maximum  CE  generated  by  a  correctly  classified  example  of  Ut  occurs  at  the  vertex  of  correct 
space  farthest  from  Ycoma  ■  This  is  the  output  state  in  which  yi  =  h  and  all  the  other  discriminator  outputs 
are  smaller  by  an  infinitesimally  small  positive  value  e : 


max  CE  correct  = 


lim 

e-»0+ 


-log  ~  j  S 


€  >  0  (5.63) 
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Thus,  the  “inner”  boundary  value  of  CE  in  monolonic  incorrect  space  is  infinite.  That  is,  an  example  is 
never  surely  misclassified,  since  a  correctly  classified  example  can  generate  an  infinite  value  of  CE.  As  a 
result,  the  monotonic  fraction  of  incorrect  space  is  zero  for  all  C  >  2: 

CEIT^X  =  0  VC  >  2  (5.65) 

The  minimum  CE  generated  by  an  incorrectly  classified  example  of  UJ\  occurs  when  jv  (i.e.,  yi )  is 
infinitesimally  smaller  than  5v  and  all  the  other  discriminator  outputs  are  minimal.  By  (5.62), 


min  CE  incorrect  = 


lim  -  log  1  ^.vi  -  /  I  -  log  I  6  -  vi  +  e  I  -  (C  -  2)  •  log(/i  -  /)  ;  e  >  0 


=  -logO’i  -  /)  -  log(A  -  yi)  -  (C  -  2)  •  log(/i  -  /) , 
which  is  minimal  when 


(5.66) 


yr  =  yr  = 

y\ 


h  +  / 
2 


(5.67) 


Thus,  (5.66)  reduces  to 


min  CE  incorrect  =  — C  •  log  (6  -  /)  +  log(4) 


(5.68) 


Equation  (5.68)  defines  the  “outer”  boundary  value  of  CE  in  monotonic  correct  space.  If  is  an  example 
of  U7i ,  it  is  JMrc/y  correctly  classified  if  CE  <  — C  •  log(^  —  /)  +  log(4).  Assuming  a  logarithmic  basis 
of  b  (i.e.,  the  notation  log  actually  denotes  logj, ),  an  example  is  surely  correctly  classified  when 

y,  >  /  +  (5.69) 

The  CE  objective  function  is,  by  (5.16),  (5.64),  and  (5.68),  non-monotonic  for  C  >  2.  Indeed, 
CEIJ'monoiC)  =  0  for  all  C  >  2 ,  by  (5.65),  and  section  1.3  proves  that  the  two  other  monotonic  fractions 
of  the  CE  objective  function  decrease  with  increasing  values  of  C  according  to  the  following  formulae  when 
y  =  (/  =  0,fi=  11^^: 


r- 1 


CEMJ^iC)  =  CECJ^„„„{C)  =  1  -  A  •  j;  •  [ln(A)f ; 

7=0 


<(C) 


y  =  [/  =  0,/r  =  II^- ,  A  =  (4-^)'  =  1 


(5.70) 
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It  is  straightforward  to  prove  the  following  relationship 

lim  C(C)  =  exp[-ln(A)]  =  (5.71) 

C—¥00  A 

so  all  the  CE  monotonic  fractions  go  to  zero  as  the  number  of  classes  C  grows  large; 

lime— foe.  CEI^ ~  0 

CECJ^„^„{C)  =  0  VC  >  2  (5.72) 

lime-foc  CEMF^(C)  =  0 

Expressions  for  CEMiF(C)  and  CECf„„„(C)  are  considerably  more  cumbersome  for  the  more  general 
discriminator  output  space  y  =  [/,  (see  section  1.3).  Although  the  specific  values  of  CEMiF(C)  and 
CEdF^moiC)  change  with  /  and  h,  the  general  dependence  on  C  is  well-described  by  (5.70)  as  long  as  / 
and  h  are  finite  —  a  constraint  that  is  consistent  with  (2.60).  For  these  reasons,  we  omit  the  more  general 
expressions. 

Example  5.16  Figure  5.7  illustrates  the  non-monotonic  nature  of  CE  when  C=:2(^  =  [/  =  0,A=:  1]^ 
for  the  purpose  of  illustration).  The  logarithmic  basis  of  (S.61)  is  10  (i.e.,  log(  • )  denotes  Iog|o(  • )  in 

(5.61)).  By  (5.70),  the  monotonic  fractions  are  =  0  and  CEM!F{2)  =  CEC.?^moiio(2)  = 

.403.  Example  1  generates  a  discriminator  output  state  of  Y'  =  (1,0),  so  it  is  correctly  classified; 
CE(Y'  )  =  0 .  Example  2  generates  a  discriminator  output  state  of  Y*  =  (.45,  .55) ,  so  it  is  incorrectly 
classifled;  CE(Y^)  =  .69.  Example  3  generates  a  discriminator  output  state  of  Y^  =  (.93,  .85),  so 
it  is  correctly  classified;  CE(Y^)  =  .86.  The  CE  generated  by  example  1  is  minimal.  However,  the 
incorrectly  classified  example  2  generates  a  more  optimal  value  of  CE  than  the  correctly  classified  example  3 
generates. . . 

Example 5.17  Figure  5.8  illustrates  the  non-monotonic  nature  of  CE  when  C  =  3.  The  figure  shows 
two  views  of  discriminator  output  space  (which  is  y  =  (/  =  0,A  =  1]-^  for  the  purpose  of  illustration). 
Again,  log(  • )  denotes  log|o(  • )  in  (5.61).  The  left-hand  figure  shows  discriminator  output  space  from 
the  perspective  of  Ycoma :  the  right-hand  figure  shows  discriminator  output  space  from  the  perspective  of 
YftKBfrert. By  (5.70),  CEI.^,«*„(3)  =  0  and  CEMT{3)  =  CECJ^^(3)  =  .163. Example  I  generates 
a  discriminator  output  state  of  Y'  =  (1,. 776,  .776),  so  it  is  correctly  classified;  CE(Y')  =  I.3.Example2 
generates  a  discriminator  output  state  of  Y^  =  (.387, 0,  .6 1 3),  so  it  is  incorrectly  classified;  CE(Y^)  =  .83. 
Thus,  the  correctly  classified  example  1  generates  a  less  optimal  value  of  (TE  than  the  incorrectly  classified 
example  2  generates;  like  MAE  and  MSE,  the  CE  objective  function  is  non-monotonic  (C  >  2 ). 

Given  (5.71 ),  (5.70)  can  also  be  expressed  as 
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Figure  S.7:  The  Kullback-Leibler  information  distance  (CE  objective  function)  is  non-monotonic.  Example  3 
generates  higher  Oess  optimal)  CE  than  example  2,  even  though  example  3  is  correctly  classified,  whereas 
example  2  is  not  For  every  point  in  the  light  gray  shaded  region  of  correct  space  there  is  at  least  one 
point  in  the  dark  gray  shaded  region  of  incorrect  space  with  lower  CE.  The  figure  depicts  discriminator 
output  space  for  a  hypothetical  C  =  2-class  task  in  which  the  discriminator’s  outputs  are  bounded  on 
y  =  [/  =  0,/i  =  I ]^.  Since  the  classifier  has  two  discriminant  functions,  discriminator  output  space  and 
reduced  discriminator  output  space  are  one  and  the  same. 


CEMJ^(C)  =  CECJ^^iC)  =  A  .  f;  .  [ln(A)]' 

j=c  J- 

[-ln(A)]" 

T{C  +  1)  ’ 


y  =  (/  =  0./.=  I)^  A=(4fi)*  =  i 


(5.73) 


The  upper  bound  of  (5.73)  is  tight  for  small  C  and  loose  for  large  C.  Thus,  the  CE  objci  tive  function  is  like  its 
MAE  and  MSE  counterparts:  CE  MJ^{C)  and  CECTmmi>(C)  decrease  super-exponentially  as  C  increases, 

although  CE >17^(0  isO  [[-InlA)]*^  •  [r(C -I-  1)]"']  whereas  MAEMJ^{C)  isC»[[r(C-l-  1)]-']  and 
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Y 

^atrrect 


Figure  S.8;  The  CE  objective  function  is  increasingly  non-monotonic  as  C  increases.  Output  yi  is  yr , 
the  output  corresponding  to  the  class  label  for  the  two  training  examples  shown.  Contours  of  constant  CE 
are  projected  onto  the  bounding  faces  of  discriminator  output  space.  White  denotes  the  monotonic  regions 
of  discriminator  output  space,  light  gray  denotes  the  non-monotonic  region  of  conu;t  space,  and  dark  gray 
denotes  the  non-monotonic  region  of  incorrect  space.**  Example  1  is  correctly  classified,  yet  it  generates  a 
higher  (less  optimal)  value  of  CE  than  example  2,  which  is  incorrectly  classified.  Left:  The  discriminator 
output  space  ^m  the  perspective  of  Ycnmrcr  Right;  The  discriminator  output  space  fixim  the  perspective  of 
Ymcnrmr-  The  monotonic  fraction  of  incorrect  space  is  zero;  the  monotonic  fraction  of  correct  space  is  .163; 
the  monotonic  fraction  of  discriminator  output  space  is  therefore  .163. 

'The  light  gray  shading  underneath  the  cubic  form  of  the  discrinunator  output  space  is  an  imaginary  shadow;  it  helps  to  clarify  the 
cube’s  orientation. 


lASEMiF(C)  is  O  [(r(§  -1-  1)1“'].  When  the  number  of  classes  is  ten,  the  monotonic  fractions  are  quite 
small: 


CEAf.F(10)  =  CEC^^„{\0)  =  2.06  x  10“* 

CEZJ^^oOO)  =  0;  (5.74) 

y  =  (/  =  O.A  =  1)'® 


5  J.4  The  General  Error  Measure  is  Non>M(Hiotonic 

The  findings  of  the  preceding  three  sections  are  directly  linked  with  the  section  3.4  proofs  that  error  measures 
engender  inefficient  learning.  Specifically,  (3.40)  proves  that  minimizing  the  genera)  error  measure  does 
not  minimize  the  classifier's  error  rate.  Thus,  (3.40)  proves  that  error  measures  fail  to  satisfy  (S.16)  and 
definition  5.10:  they  are  non-monotonic. 
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The  section  3.6  exception  to  the  rule  that  probabilistic  learning  is  inefficient  does  not  contradict 
our  assertion  herein  that  error  measures  are  non-monotonic.  When  the  hypothesis  class  is  a  proper 
parametric  model  of  the  feature  vector,  probabilistic  learning  via  the  error  measure  that  generates  maximum- 
likelihood  estimates  of  the  discriminator's  parameters  is  efficient.  Under  these  circumstances,  the  functional 
characteristics  of  the  discriminator's  hypothesis  class  compensate  for  the  non-monotonic  nature  of  the  error 
measure  (see  below);  the  error  measure  itself  remains  non-monotonic,  independent  of  the  choice  of  hypothesis 
class. 

5.3.5  The  Link  Between  Objective  Function  Monotonicity  and  Learning  Efficiency 

The  monotonic  objective  function  engenders  asymptotically  efficient  learning,  regardless  of  the  choice 
of  hypothesis  class  (section  3.3).  The  non-monotonic  objective  function  engenders  inefficient  learning 
if  the  hypothesis  class  is  improper  (section3.4);  if  the  hypothesis  class  is  proper,  there  may  be  an  error 
measure  that  induces  efficient  learning,  despite  its  non-monotonic  nature  (section  3.6).  That  is,  under  proper 
conditions  the  non-monotonic  error  measure  induces  efficient  learning  because  the  discriminator’s  proper 
nature  compensates  for  the  small  value  of  MT .  In  some  cases  (see  below)  the  “proper”  discriminator’s 
correct  output  states  are  constrained  to  lie  in  the  monotonic  region  of  conect  space.  It  remains  an  open 
question  whether  this  is  always  the  case  when  the  hypothesis  class  constitutes  a  proper  parametric  model  of 
the  feature  vector. 

Under  improper  conditions  the  discriminator’s  functional  properties  fail  to  constrain  the  discriminator’s 
correct  output  states  to  lie  in  the  monotonic  region  of  correct  space,  and  the  small  value  of  M!F  results 
in  inefficient  learning.  Indeed,  as  -*  0,  the  objective  function’s  value  during  learning  may  tell  us 
absolutely  nothing  about  the  classifier’s  empirical  training  sample  error  rate  if  the  hypothesis  class  is  not 
the  proper  parametric  model  of  the  feature  vector  and  the  error  measure  is  not  the  one  associated  with  the 
maximum-likelihood  probabilistic  learning  procedure  described  in  hypothesis  3.1  (page  77). 

Chapter  4  clearly  illustrates  the  three  fundamental  scenarios  we  address  in  chapter  3  and  this  section: 

A  non-monotonic  objective  ftinction  paired  with  a  proper  parametric  model  can  induce  efficient 
learning  —  The  discriminant  functions  of  the  partially  parametric  proper  model  in  section  4.2  are  given 
by  (4.4).  These  equations  guarantee  that  the  discriminator  output  state  lies  on  the  discriminant  continuum 
(definition  5. 1 ),  regardless  of  the  discriminator’s  parameterization.  This  constraint  compensates  for  the 
non-monotonic  nature  of  the  error  measures  used  for  probabilistic  learning  because  the  discriminator  simply 
cannot  produce  output  states  that  lie  in  the  non-monotonic  region  of  correct  space.  As  a  result,  the  hypothesis 
class’s  functional  properties  in  (4.4)  ensure  that  (S.I6)  is  never  violated.  This  ensures  that  probabilistic 
learning  via  the  general  error  measure  is  asymptotically  efficient.  Moreover,  if  the  error  measure  associated 
with  the  maximum-likelihood  learning  procedure  defined  in  hypothesis  3.1  exists,  it  will  induce  efficient 
learning  for  small  sample  sizes  as  well.  This  is  precisely  the  scenario  we  find  in  section  4.2. 
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A  non-monotonic  objective  function  paired  with  an  improper  parametric  model  always  induces 
inefficient  learning  —  The  discriminant  functions  of  the  improper  parametric  model  in  section  4.3  are 
given  by  (4.32).  These  equations  do  not  guarantee  that  the  discriminator’s  correct  output  states  will  always  lie 
in  the  monotonic  region  of  correct  space.  As  a  result,  the  hypothesis  class's  functional  properties  fail  to  ensure 
that  (S.  1 6)  is  never  violated.  This  ensures  that  probabilistic  learning  via  the  general  error  measure  is  inefficient. 
Indeed,  by  (5.57),  MSEI.r„„„„(3)  <  | ,  MSEC’.;"„,«,«{3)  =  0.185,  and  MSE.V(.F(3)  <  0.352,  and  the 
MSE-generated  minimum-complexity  polynomial  classifier  of  section  4.3  exhibits  an  1 8%  discriminant  bias 
for  both  small  and  large  training  sample  sizes. 

A  monotonk  objective  function  paired  with  any  hypothesis  class  always  induces  asymptotkally  efficknt 
learning  —  Since  the  CFM  objective  function  is  monotonic  (see  section  5.3.6  below),  definition  5. 10  and 
(5.16)  are  always  satisfied,  regardless  of  the  hypothesis  class’s  functional  properties.  As  a  result,  differential 
learning  is  always  asymptotically  efficient  —  a  fact  that  is  demonstrated  in  the  experiments  of  chapter  4  and 
part  II. 


5.3.6  CFM  is  Monotonic 

As  we  stated  earlier,  if  the  objective  function  is  to  be  monotonic,  (5.16)  must  hold.  The  CFM  objective 
function  always  satisfies  (5.16),  since  it  is  a  function  of  only  two  discriminator  outputs  (yr  and  y7) 
regardless  of  the  number  of  classes  C.  Thus,  the  CFM  generated  by  a  correctly  classified  example  is  always 
greater  than  the  CFM  generated  by  an  incorrectly  classified  example  (e.g.,  see  figure  5.2.)  This  statement  is 
true  for  any  and  all  choices  of  the  CFM  confidence  parameter  t/’.  As  a  result, 

C¥MIT„^[C)  =  IT  =  ^  VC  >  2 

CFMCT„^(C)  =  CT  =  ^  VC>2  (5.75) 

CFMMTiC)  =  1  VC  >  2 

So  far  we  have  focused  on  whether  or  not  the  objective  function's  contours  of  constant  value  are  parallel  to 
the  discriminant  boundary.  Although  this  condition  —  expressed  mathematically  in  (5. 16)  —  is  a  necessary 
one  for  monotonicity,  it  is  not  sufficient  to  satisfy  definition  S.IO.  The  reason  for  this  lies  in  the  difference 
between  an  objective  function’s  being  monotonic  for  a  single  example,  versus  its  being  monotonic  for  all 
examples  in  the  training  sample.  This  latter  kind  of  monotonkity  is  true  monotonicity. 

CFM  b  Truly  Monotonk 

In  the  case  of  CFM,  (2. 1 02)  and  (2. 1 04)  must  also  be  satisfied  in  order  for  CFM  to  be  truly  monotonic,  thereby 
inducing  asymptotically  efficient  learning.  We  clarify  this  notion  of  true  monotonicity  with  a  hypothetical 
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example,  showing  that  CFM  is  non-monotonic  when  (2.104)  is  violated,  but  monotonic  when  (2.104)  is 
satisfied.  A  similar  illustration  can  be  made  regarding  the  constraint  of  (2.102). 

Consider  a  two-class  pattern  recognition  task  for  which  the  feature  is  a  random  scalar  x .  The  class  prior 
probabilities  are 


^*^•(^*^1)  =  Pwiixfi)  =  2' 
and  the  class-conditional  pdfs  of  jc  are  given  by 


(5.76) 


P,\w{A^\)  ^  <5(-r+l) 

=  A  ■  Six  +  .9)  +  .9  ■  S(x  - 


(5.77) 


where  • )  denotes  the  Dirac  delta  function  (e.g.,  (80,  pg.  266]).  The  a  posteriori  class  probabilities  arc 
therefore 


Pnit  l-*^)  = 

(5.78) 

Pw|jr(t^2|-«)  =  6ix  -I-  .9)  4-  <5{x  -  I) 

Figure  5.9  depicts  the  class-conditional  pdf  —  class  prior  probability  products  as  well  as  the  a  posteriori 
class  probabilities  of  x.  The  light  blue  arrows  depict  the  Dirac  delta  functions  associated  with  class  OJi , 
and  the  red  arrows  depict  those  associated  with  class  UJ2  ■  The  colored  circles  in  the  a  posteriori  class 
probability  plots  indicate  the  values  of  x  for  which  Pvv|.(  (^]  (light  blue)  and  Pw^t  [ufi  |  (red)  are 
zero.  Obviously,  the  two  classes  that  x  can  represent  are  linearly  separable;  a  linear  classifier  need  only  form 
a  boundary  on  the  open  interval  (— 1 ,  —  .9),  which  constitutes  the  Bayes-optimal  class  boundary 


-  1  <  ^1,2  ft., «  <  -.9  (5.79) 

This  open  interval  is  depicted  by  the  swath  of  gray  shading  in  figure  5.9. 

We  employ  a  minimum-complexity  linear  classifier^  with  the  discriminator  $(x|0)  = 

{gi(x|0),g2(^|(7)}:thediscriminantfunctionsof  0(x]0)  aregivenby 


gi(x|»)  = 

g2{x\9)  =  -gi{x\0)  =  -  @0. 


(5.80) 


so  the  discriminator’s  parameter  vector  reduces  to  the  single  paraireter  ^0 .  By  convention,  we  sometimes 
use  the  more  general  vector  notation  6 ,  assuming  that  the  reader  understands  the  equivalence  between  9 
and  9o  in  this  particular  example.  We  denote  the  discriminator  parameterization  that  maximizes  CFM  by  9^  . 
Given  the  discriminator  in  (5.80)  and  the  parameter  space  9  =  1},  discriminator  output  space  is  3^  =  1}^ . 


’The  complexity  measure  is  Ihe  classiner's  number  of  parameters.  ,vhich  is  one. 
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Learning  is  simply  a  search  for  6q  over  &  (i.e.,  ihc  domain  of  do ),  given  a  specific  value  of  the 
CFM  confidence  parameter  xJ.K  Figure  5.10  illustrates  this  search  space:  CFM  is  plotted  as  a  function  of 
the  discriminator’s  single  parameter  6q  and  the  CFM  confidence  parar'etcr  Specifically,  the  plot  shows 
the  CFM  generated  by  an  asymptotically  large  training  sample,  given  the  discriminator  Q{x\0)  and  t/’: 
limn-t^  CFM  (5"  1 9).  The  figure’s  color  coding  emphasizes  the  value  of  0:  magenta  indicates  tp  =  • 
and  blue  indicates  tj'  \  intermediate  values  of  xl'  correspond  to  a  natural  progression  from  magenta 

to  blue  via  the  intermediate  color  yellow.  The  black  contours  on  the  figure  denote  the  value  of  CFM  as  a 
function  of  V’  for  fixed  increments  of  6o  ■ 

Given  Q{x\0)  in  (5.80),  the  class  boundary  it  forms  is 

Bi.2  =  200  (5.81) 

It  should  be  clear  from  the  probabilistic  nature  of  x  illustrated  in  figure  5.9  that  the  classifier  with  the 
discriminator  described  by  (5.80)  will  correctly  classify  all  examples  of  jt  if 

=  g2(x\9)  =  0  for  some  X  6  (-1,  -  .9)  (5.82) 

When  (5.82)  is  satisfied,  the  differentially  generated  class  boundary  is  Bayes-optimal; 


-.5  <  00  <  --45 

X  i-  B\,2CFM  =  Bi,2  Bayes  G  (-1.  --9) 


(5.83) 


The  key  question  is,  for  what  values  of  0  is  CFM  monotonic:  that  is,  what  values  of  0  guarantee 
that  maximizing  lim„_>oo  CFM  {S"  \  9)  minimizes  the  classifier’s  error  rate  P,  (Q 1 9)  by  generating  the 
parameterization  9g  pf  (5.83)?  The  answer;  those  values  of  0  for  which  (2.102)  and  (2.104)  arc  satisfied.® 
Maximizing  CFM  subject  to  the  constraints  of  (2.102)  and  (2.104)  minimizes  the  classifier’s  empirical 
training  sample  enor  rate  and,  by  the  proof  of  section  3.3,  induces  asymptotically  efficient  learning.  When 
(2.102)  and/or  (2.104)  are  not  satisfied,  maximizing  (TFM  is  not  guaranteed  to  minimize  the  classifier’s 
empirical  training  sample  enor  rate:  CFM  is  non-monotonic  and  may  ind  -e  inefficient  lea  'g. 


Example 5.18  The  largest  positive  discriminant  differentials  that  Q[x\9)  in  (5.80)  can  generate  for 
examples  of  X  =  —1  and  x  =  —.9  are 

^Section  7.4  describes  a  practical  approach  lo  difTerential  learning  via  a  scheduled  reduction  of  the  CFM  confidence  parameter;  this 
satisfies  Ihe  uppei  bound  constraints  on  0  imposed  by  (2.l02)and  (2.104). 
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Figure  5.9:  The  simple  two-class  scalar  feature  discrimination  task  for  which  CFM  is  monotonic  if  and  only  if 
t  ’  ^  .  1 1  .  From  top  to  bottom:  the  class-conditional  density  -  class  prior  products  (x  |  )  •  Pr  (CJi  ) 

and  (jr|Cc^2)  •  Pw(U^2);  Pw|i  |jc) ,  the  a  po.«erton  probability  of  class  UJ\\  Pvv|i  (t(^2 1  :r) .  the 
a  posteriori  probability  of  class  U)2  ■  Two  sets  of  linear  discriminant  functions  are  shown  superimposed  on 
the  a  posteriori  class  probabilities  of  x:  the  magenta-colored  functions  are  generated  by  maximizing  CFM, 
given  the  confidence  parameter  t'  =  1 ;  the  blue-colored  functions  are  generated  by  maximizing  CFM, 
given  the  confidence  parameter  ’  =  .05 .  Note  that  classifier  generated  with  (.  ■  =  1  incorrectly  classifies 
all  examples  of  x  =  -.9 ,  so  CFM  is  non-monotonic  for  =  1 ,  given  x  described  by  (5.76)  —  (5.78) 
and  the  discriminator  Q{x\6)  described  by  (5.80).  However,  when  v  is  reduced  to  a  value  of  .05,  the 
classifier  generated  by  CFM  correctly  classifies  all  examples  of  x .  Indeed,  CFM  is  always  monotonic,  given 
a  sufficiently  small  value  of  i. '  —  a  relationship  formalized  by  the  constraints  of  (2.102)  and  (2.104). 
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Figure  5.10;  The  CFM  (i.e.,  CFM  {S"  \  6)  )  generated  by  an  asymptotically  large  training  sample 

of  the  two-class  random  feature  x  described  by  (5.76)  —  (5.78):  CFM  is  plotted  as  a  function  of  the 
single  discriminator  parameter  Oq  and  the  CFM  confidence  parameter  c  ■  Given  x  and  the  simple  linear 
discriminator  Q{x\d)  described  by  (5.80),  the  plot  shows  that  CFM  is  monotonic  if  and  only  if  i, '  is  small 
(boxed  region).  That  is,  the  parameterization  0^  that  maximizes  CFM  also  minimizes  the  classifier’s  training 
sample  error  rate  if  and  only  if  c  is  sufficiently  small. 


Figure  5.1 1:  Details  of  the  maximum  CFM  (i.e.,  max^  lim„_>x  CFM  (S"  |  ff)  for  small  ’ )  generated  by 
an  asymptotically  large  training  sample  of  the  two-class  random  feature  x .  The  figure  shows  the  boxed 
region  of  figure  5.10  as  it  would  appear  viewed  along  an  axis  that  is  parallel  to  the  '  axis  (i.e.,  the  image 
plane  of  this  figure  is  parallel  to  the  CFM  —  ffo  plane  in  figure  5.10).  CFM  is  plotted  as  a  function  of 
the  single  discriminator  parameter  ffo  '■  different  contours  represent  different  CFM  functions  of  ,  given 
different  values  of  the  confidence  parameter  i ' .  The  color  coding  of  the  contours  denotes  the  value  of  i  ’ 
and  corresponds  to  the  scheme  used  in  figure  5.10.  The  contours  are  in  increments  of  .01,  from  .31  (green) 
to  .01  (blue):  the  lower-bound  contour  of  the  figure  is  ( ’  =  .002 .  The  CFM  objective  function  is,  by 
(2.104),  guaranteed  to  be  mopotonic  if  t,  ’  <  .05  — an  upper  bound  denoted  by  the  white  highlighted 
contour  of  the  figure.  Note  that  CFM  is  maximized,  given  this  value  of  l’ ,  for  -.478  ;S  ~  -  467 .  Thus, 
the  differentially  generated  classifier  ( l '  <  .05 )  yields  Bayesian  discrimination,  by  (5.83),  since  CFM  is 
maximal  for  do  v^ues  that  fall  within  the  gray-shaded  open  interval  (-.5,  -.45). 
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5|(jt  =  -1  |0o  =  -.475)  =  =  -1 1^0  =  -.475)  -  gi(x  =  -1  |0o  =  -.475)  =  .05 

> - - -  > - ... - ^ 

.025  -  025 

62{x  =  -.91^0  =  -.475)  =  g2{x  =  -.91«o  =  -.475)  -  g^{x  =  -.9|flo  =  -.475)  =  .05 

' - V - '  ' - V - ' 

.0Z5  -.015 

(5.84) 

By  (2. 104),  V’  fnust  therefore  be  no  greater  than  .05  in  order  to  guarantee  that  CFM  is  truly  monotonic.  If 
we  maximize  Iimn-»3c.  CFM  (5"  1 0)  using  the  synthetic  CFM  objective  function  described  in  appendix  D 
with  (/»  =  1 ,  =  —.0125.  One  can  see  that  that  CFM  peaks  at  this  value  of  00  in  figure  5.10:  the  back 

edge  of  the  CFM  function’s  magenta  region  corresponds  to  =  \ ,  and  it  peaks  at  6q  =  —.0125.  This 
parameterization  generates  the  magenta  discriminant  functions  of figure  5. 10,  which  form  the  class  boundary 
Bi,2  CFU  =  —.025  ^  B|,2aav«.  Thus,  CFM  is  non-monotonic  for  =  1 ,  given  x  and  Q(x  \  0) . 

If  we  reduce  tl>  to  a  small  value,  CFM  becomes  monotonic.  Figures  5.10  and  5.1 1  illustrate  this 
transformation.  Again,  the  color-coding  of  figure  5. 10  reflects  the  value  of  tp :  large  values  of  rp  correspond 
to  the  magenta  part  of  the  image,  intermediate  values  of  tp  correspond  to  the  yellow  part  of  the  image,  and 
small  values  of  tp  correspond  to  the  blue  part  of  the  image.  CFM  takes  on  a  maximum  value  of  ~  .95  for 
all  but  small  values  of  V' .  reflecting  the  linear  classifier's  inability  to  learn  that  all  examples  of  x  —  —.9 
are  examples  of  class  U)2  •  As  V’  becomes  small  (i.e.,  ;S  .1 1 ),  CFM  takes  on  a  nuiximum  value  of  ~  1 
for  values  of  0q  corresponding  to  the  boxed  portion  of  the  figure.  Figure  5. 1 1  depicts  this  region  in  detail. 
CFM  is  plotted  as  a  function  of  the  single  discriminator  parameter  Oq  ,  given  small  values  of  V’ .  Different 
contours  in  the  figure  depict  different  CFM  functions  associated  with  different  values  of  V’  from  .31  (green) 
to  .01  (blue)  in  increments  of  .01:  the  bounding  (i.e.,  outer-most  dark  blue)  contour  b  for  xp  =  .002 .  One 
can  see  from  the  figure  that  CFM  is  non-monotonic  for  tp  >  .11  because  it  peaks  at  Oa  >  —.AS ;  therefore, 
Bi,2CFM  #  ®i,28m«.  However,  for  tp  <  .05  (the  contour  associated  with  tp  .=  .05  is  highlighted  in 
white),  CFM  is  monotonic  because  it  peaks  between  —.478  ^0q  ^  —  .467 ;  therefore,  B\^2  cfm  =  Bi,2  Bayts  ■ 
Given  the  parameterization  9q  =  —.A15  for  tjt  =  .05,  the  blue  discrimiitant  functions  of  figure  5.9  result; 
they  form  the  class  boundary  B\, 2  CFM  =  —.95  =  B\,2Ba\t!t  Thus,  CFM  is  truly  monotonic  for  tp  <  .05, 
given  X  and  0(x\6). 

MAE  is  not  truly  monotonic  for  any  C  —  It  is  interesting  to  note  that  minimizing  the  MAE  of  G{x\  0) 
(i.e., minimizing  lima_*.x' MAE  (5"  I  for  D  =  1  and -<D  =  0)generatesthe  “optimal"  parameterization 
—.5  <  0^  <  .5  (we  omit  the  details  of  the  analysis  in  the  interest  of  brevity).  All  parameterizations  in  this 
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interval  exhibit  the  same  (minimum)  value  of  MAE  =  .5,  but  the  classifier’s  error  rate  is  minimized  only  if 
—.5  <  <  —.45 .  Thus,  despite  its  satisfying  (5.16)  for  C  =  2 ,  the  MAE  objective  function  is  not  truly 

monotonic  for  the  arbitrary  combination  of  feature  vector  and  hypothesis  class. 


5.4  Training  Example  Types 

There  is  a  taxonomy  of  training  examples  that  follows  naturally  from  the  preceding  illustrative  example  and 
the  more  general  differential  learning  scenario.  Figure  5.12  illustrates  three  categories  of  training  examples: 
un-leamed,  learned,  and  transition  examples  (see  definitions  D.l  —  D.3).  Each  type  is  characterized  by  the 
value  of  its  associated  discriminant  differentiar  6 .  Un-leamed  training  examples  are  misclassified;  as  a 
result,  they  exhibit  negative  discriminant  differentials.  Learned  examples  are  ones  that  generate  the  maximum 
value  of  CFM,  so  they  must  have  positive  discriminant  differentials.  The  minimum  value  of  6  for  which 
an  example  is,  “learned,”  which  is  given  by  p,j,  in  figure  D.l,  depends  on  the  value  of  the  confidence 
parameter  (A .  As  0*^  and  the  synthetic  CFM  sigmoid  grows  steep,  this  minimum  value  of  6  decreases 
to  zero.  Transition  examples  exhibit  discriminant  differentials  that  correspond  to  the  transition  region  of  the 
synthetic  CFM  sigmoid  (i.e.,  Xtm  <  ^  <  Pxp .  where  and  p,p  are  shown  in  figure  D.l).  Simply  put, 
un-leamed  examples  exhibit  negative  discriminant  differentials,  learned  examples  exhibit  relatively  large 
positive  discriminant  differentials,  and  transition  examples  exhibit  small  (positive  or  negative)  discriminant 
differentials.  By  definitions  D.  1  —  D.3,  some  un-leamed  examples  are  also  uansition  examples. 


The  confidence  parameter:  Recall  that  I X)  denotes  the  largest  a  posteriori  class  probability 

for  X,  U),  denotes  the  Bayes-optimal  (i.e.,  most  likely)  class,  and  (5.{X|d*)  denotes  the  associated 
discriminant  differential  g.(X|d")  -  maxj^^.  g.(X|d*).  The  notation  6‘  indicates  that  the  classifier's 
parameterization  is  the  one  generated  by  maximizing  CFM.  Throughout  the  present  discussion,  we  assume 
that  the  discriminator  possesses  sufficient  functional  complexity  to  learn  the  Bayes-optimal  classifier  of  X. 
As  a  result,  we  assume  that  (5.(X|d*)  is  positive,  as  long  as  V’  is  sufficiently  small. 

The  relationship  between  small  values  of  I X)  and/or  <J.(X  |  d' )  and  small  values  of  tl> ,  codi- 

fiedby(2.W2)and(2.l04),accountsforouruseoftheterm  "confidence parameter” for  iIk  |X) 

and/or  6. (X\6‘)  are  small  fora  given  X  €  X>  V’  mustbesmall — we  should  literally  have  low  confidence 
that  the  classifier  will  leam  U). ,  the  Bayes-optimal  class  label  of  X.  Our  necessary  lack  of  confidence  reflects 
the  small  a  posteriori  probability  of  class  U,  and/or  the  functional  limitations  of  the  hypothesis  class  at  X . 


This  notion  of  confidence  is  related  to  the  difference  between  learning  easy  examples  and  learning  hard 
^Again,  absent  atubscripl,  the  noution  S  implies  the  discriminant  difTeiential  6t  of  (2.22)  and  (S.  14). 
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Figure  S.  1 2:  Three  types  of  training  examples:  un-leamed  examples  exhibit  negative  discriminant  differentials; 
transition  examples  exhibit  discriminant  differentials  that  correspond  to  the  transition  region  of  the  synthetic 
CFM  sigmoid  (therefore,  some  un-leamed  examples  are  also  transition  examples);  learned  examples  have 
positive  differentials  that  correspond  to  the  maximum  CFM  value  of  unity. 


examples. 

Definition  5.19  The  easy  training  example:  Probabilistically,  the  easy  example  of  class  =  UJi 

is  found  far  from  the  Bayes-optimal  class  boundaries  on  near  a  mode  of  its  class-conditional  pdf  (Le., 

|U^,)  «  max  p,|^(X|at,)).  Assuming  nominally  equal  prior  probabilities  for  all  classes,  the 

easy  example's  a  posteriori  class  probability  Pvv|»(^-  and  the  associated  discriminant  differential 
6,(\>  1 9‘)  are  therefore  large  (i.e.,  allowing  learning  to  be  accomplished  with  high  confidence  (ue., 
V’  «  I  )  by  (2. 102)  and  (2. 104). 

Example  5.19  All  examples  of  jr  =  1  and  jc  =  —  I  — where  x  is  described  by  (5.76)  —(5.78)  — 
are  easy.  Note  in  figure  5.9  that  these  examples  have  a  posteriori  class  probabilities  of  one,  and  Q(x  1 0) , 
described  in  (5.80),  generates  relatively  large  positive  discriminant  differentials  for  them  when  CFM  is 
maximized  for  V*  =  1  ^ 

<5i(jc  =  -1 19^  =  -.0125)  =  .975 
r  ;  V'  =  >  (5.85) 

S2(x  =  Ijej  =  -.0125)  =  1.025 

Recall  that  the  magenta  discriminant  functions  of  figure  5.9  are  those  that  maximize  CFM  for  V’  =  1  • 

Definition  5.20  The  hard  training  exampie:  Probabilistically,  the  hard  example  X^  of  class 

=  UJi  isfound  in  the  vicinity  ofihe  class  boundaries  on  "X.,  in  a  "tail”  of  its  class-conditional  pdf  (i.e.. 
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P»|>v(X^  |t^i)  ^  max  /?,ny(X  |Cc?,)|  In  these  tails,  Pvv|»(t^*  I X'')  and/or  5.(X^|®')  are  relatively 

small  (i.e.,  \  ),  so  the  hard  example  must,  by  (2. 102)  and  (2.104),  be  learned  with  low  confidence  (i.e., 

0  — »  0^  )  if  it  can  be  learned  at  all. " 

Example  5.20  All  examples  of  x  =  —.9  — where  x  is  described  by  (5.76)  —  (5.78)  — are  hard.  Note 
in  figure  5.9  that  these  examples  have  a  large  a  posteriori  cjass  probability  of  Ptvjx  (^2 1 X  =  — .9)  =  1 , 
but  their  class-conditional  pdf  is  small  (p,|,y  (X  = —.9 1 (Jj)  =  .1  ^  /)^|yy  (X  =  1  |Ct72)  =  .9). 
The  discriminator  G(x\0'),  described  in  (5.80),  at  best  generates  relatively  small  positive  discriminant 
differentials  (i.e.,  <  .05 )  for  these  examples,  and  does  so  only  when  CFM  is  maximized  for  tp  <  .05  (recall 
(5.84)  and  note  that  the  blue  discriminant  functions  offigure  5.9  are  those  that  maximize  CF^  for  (r'*  =  -05). 

5.5  The  Convergence  Properties  of  Differential  Learning  via  CFM 

Differential  learning  via  synthetic  CFM,  by  design,  ignores  learned  examples:  these  examples  have  no 
effect  on  learning  because  the  synthetic  CFM  objective  function  is  maximum  and  all  its  derivatives  are 
zero  for  learned  examples.’  Only  unlearned  and  transition  examples  generate  non-zero  first  and  higher 
order  derivatives  of  the  synthetic  objective  function  (see  (D.7)  —  (D.9)  in  section  D.  I ).  From  the  preceding 
section  and  chapter  2,  we  know  that  CFM  must  approach  a  modified  Heaviside  step  function  (i.e.,  ip 
must  approach  a  value  of  zero)  if  hard  examples  are  to  be  learned.  Intuitively,  then,  this  limiting  form  of 
CFM  simply  counts  correct  classifications.  Since  the  Heaviside  step  is  non-differentiable,  there  is  no  way 
to  use  it  with  differentiable  supervised  classifiers,  instead,  we  employ  CFM,  which  remains  differentiable 
and  approximates  the  non-differentiable  counting  objective  function  with  high  precision  as  its  confidence 
parameter  (/'  goes  to  zero. 

The  functional  properties  of  CFM  raise  the  issue  of  the  learning  rate  —  that  is,  the  rate  at  which 
the  search  for  the  classifier’s  (3FM-maximizing  parameterization  takes  place.  Given  our  definition  of  the 
differentiable  supervised  classifier  and  the  means  by  which  it  learns,  learning  speed  depends  on  the  first  order 
(and  possibly  higher  order  —  depending  on  the  specific  search  algorithm  employed)  derivative  of  the  CFM 
objective  function. 

Formally,  we  are  searching  for  a  parameterization  9*  by  which  CFM  is  maximized,  given  the  training 
sample  5": 

CFM{5"  1 9* )  =  max  CFM  (5"  1 9)  (5.86) 

9 

'For  stochastic  feature  vecton  with  overlapping  class-conditional  pdfs,  the  Bayes  error  rate  is  non-zero:  some  esample/class  label 
pairs  are  inevitably  un-leamable. 

’This  characteristic  results  in  a  substantial  computational  savings  when  differential  learning  is  actually  implemented  on  the  computer 
(see  section  7.S). 
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Knowing  the  value  of  CFM,  given  the  parameterization  ®[A:]  (where  k  is  simply  an  iteration  index),  allows 
us  to  approximate  its  value  for  the  parameterization  6[k  +  1]  via  the  Taylor  series 

CFM(5"|e[il:  +  1))  = 

CFM  (5"  I  eW)  +  («[*  +  1)  -  0[*))"  Vq  (CFM  (5"  |  »[*])) 

!  ) 

+  i  («>[*+  1)  -  He  (CFM  (5"  l^W))  {0[k  +  11  -  e[*l) 

+ ... 

where  denotes  the  transpose  of  vector  z,  Vq  (CFM  (5"|®H)))  denotes  the  gradient  of  CFM  with 
respect  to  the  parameter  vector  6 ,  evaluated  at  ®(A:],  and  (CFM  (5"  |^[*]))  denotes  the  hessian  of 
CFM  with  respect  to  the  parameter  vector  6 ,  evaluated  at  9[A:| . 

As  a  first  step  towards  iteratively  finding  the  psuameterization  B'  that  maximizes  CFM,  we  take  the 
derivative  of  (5.87),  set  it  equal  to  zero,  and  solve  for  0(A:  +  1] . 


(CFM(5''|e[*+  ID)  = 

Vg  (CFM(5"|0[it]))  +  {B[k  +  1]  -  »[*))"  (CFM(5-|0(A:])) 

+  higher-order  terms 
=  0 

Dropping  the  higher  order  terms  and  rearranging  the  low-order  terms  yields  the  familiar  quadratic  approxi¬ 
mation  upon  which  all  first  and  second  order  iterative  search  (i.e.,  optimization)  algorithms  are  based  (see, 
for  example,  [106,  ch.  10]): 

0[k  -1-  1)  3£  9[k]  Vg  (CFM(5"|0(jt]))  [-H<,  (CFM(5"19W))]~'  (5.89) 

Second-order  search  algorithms  compute  (or  approximate)  the  inverse  hessian;  first-order  algorithms  assume 
that  the  hessian  is  diagonal 


[-H<,  (CFM(5"|»W))]~'  =  e  I, 


(5.90) 


such  that  (5.88)  reduces  to'° 


©[Jt  +  \]  ^  ^[i()  +  eVg  (CFM(5"|gW))  (5.91) 

A»(*] 

"’We  use  a  modified  foim  of  the  backpropagation  algorithm  lo  implemem  differential  learning  via  synthetic  CFM.  Please  sec 
section  D.S  for  important  additional  details  regarding  the  implementation. 
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The  step  size  scaling  factor  e  has  a  positive  sign,  since  CFM  is  being  maximized." 

The  parameterization  O'  is  taken  to  be  after  k  iterations,  where  k  is  large.  We  are  concerned 
with  the  learning  rate:  how  large  does  k  have  to  be  to  ensure  that  9[k]  is  a  good  approximation  to  O' , 
assuming  that  CFM  (<S"|^)  is  a  unimodal  function  on  0 .  Given  (5.89)  and  (5.91),  the  answer  to  this 
question  depends  on  the  search  step  size  A0[k] ,  which  in  turn  depends  on  the  gradient  —  and  in  the  case  of 
(5.89),  the  hessian  —  of  CFM  (5"  |  ^ [A:]) .  The  gradient  and  hessian,  in  turn,  depend  on  the  training  sample 

5" ,  the  functional  properties  of  the  discriminator  Q{X  |  ®[A:))  =  {gi  (X  |  ©[A:)) . ^c(X  |  <A[A:])} ,  and  the 

functional  properties  of  the  CFM  objective  function  itself.  Dropping  the  iteration  index  k  for  notational 
simplicity,  we  recall  from  (2.8 1 )  that  the  sample  CFM  is 

CFM(5"|fl)  =  ^  ^  ((7  [<5r(X^|e),V]  :  W'  =  UJr)  (5.92) 

/=! 

Thus,  the  ith  element  of  (CFM  (5" )  6) )  is  given  by 

AcfM(5"|0)  =  1  ^  (^±a[SAXi\0),tJ>]  :  =  o;.)  (5.93) 

'  y=l  ^  ' 

Likewise,  the  i',/th  element  of  Hg  (CFM(5"|®))  is  given  by 

^  CFM  (5- 1 .)  =  1  t  <,  I .) .  V'l  :  =  W,)  (5.94) 

The  gradient  {CFM(5"|^))  and  hessian  (CFM(5"|®))  ultimately  depend  on  the  partial 
derivatives  ^  <7  [<Jr(X^  1 0) ,  V’]  and  <7  [5t(X^  1 9) ,  V'] ,  which  arc  given  by 

A  a  [SAX'  \0),tP]  =  [<5.(X>  1 0) .  V’l  •  ^  SAX^  1 9)  (5.95) 

and 


# 


# 


<7  [(5.(X^  I ») ,  v>]  =  ^  a  [Sax*  1 0) ,  V']  •  ^  <5r(x>  I  ^  I 


dOidO, 


d 


+  ^<7[<J.{X^|»),l'’) 


dOidOi 


MxM<>) 


(5.96) 


' '  When  the  objective  function  is  an  enor  measure  to  be  minimized,  e  has  a  negative  sign,  and  (S.90)  is  a  familiar  pan  of  the  gradient 
descent  equation  used  in  the  backpropagation  algorithm  [1 19. 120]  (see  section  D.S). 
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Figure  5.13:  The  synthetic  CFM  objective  function,  given  a  confidence  parameter  of  0  =  .05. 


Clearly,  then,  the  step  size  in  (5.89)  and  (5.91)  is  proportional  to  the  first  derivative  of  the  CFM 

objective  function 


i  =  a;.)  (5.97) 

/=!  '  ' 

Consequently,  the  learning  rate  is  proportional  to  the  first  derivative  of  the  CFM  objective  function. 

5.5.1  DifTerential  Learning  via  the  Synthetic  Form  of  CFM  is  Reasonably  Fast 

As  we  mentioned  earlier,  all  derivatives  of  the  synthetic  CFM  objective  function  are  zero  for  learned  examples 
(see  (D.7)  —  (D.9)  in  section  D.  I ).  Thus,  learning  focuses  on  the  un-leamed  and  transition  examples.  Since 
hard  examples  require  learning  with  low  confidence,  we  are  especially  concerned  with  the  rate  at  which 
these  hard  examples  are  learned,  versus  the  rate  at  which  transition  examples  are  learned.  Figure  5.13 
illustrates  the  synthetic  CFM  objective  function,  given  the  confidence  parameter  ip  =  .05  (recall  that  this 
is  the  level  of  confidence  required  to  learn  the  hard  examples  of  the  two-class  random  feature  described 
in  section  5.3.6).  Clearly,  the  first  derivative  of  this  function  is  quite  large  in  the  transition  region  (i.e., 
-^g\5  fis  0'*' ,  V’  =  .05]  S  61 ).  However,  it  is  considerably  smaller  for  un-leamed  examples  (i.e., 

^  <7  [(5  <  0,t/'  =  .05]  S  .02).  As  a  result,  transition  examples  dominate  the  learning  process  when  ip 
is  small,  so  un-leamed  examples  are  learned  slowly. 

This  phenomenon  is  a  natural  and  unavoidable  consequence  of  the  necessary  functional  properties  of  the 
CFM  objective  function.  Thus,  we  are  motivated  to  maximize  the  learning  rate  for  un-leamed  examples  as 
(/’  becomes  small,  subject  to  the  condition  that  CFM  must  simultaneously  converge  to  a  modified  Heaviside 
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Step  function  of  S .  Section  D.3  discusses  this  objective  at  length,  rigorously  defining  reasonably  fast 
and  unreasonably  slow  learning  in  the  process.  Therein  we  denote  the  ratio  of  [<5 ,  V']  for  transition 

examples  to  ^(7  [<5,V']  for  unlearned  examples  by  If  <^(V')  increases  exponentially  with  decreasing 

V* ,  learning  becomes  dominated  by  the  transition  examples  for  small  ij' :  the  classifier’s  parameters  are 
updated  to  transform  the  transition  examples  into  learned  examples,  while  the  un-leamed  examples  are 
ignored  (because  the  derivatives  they  elicit  are  so  small  in  comparison  to  those  of  the  transition  examples). 
Under  these  circumstances,  it  takes  an  unreasonably  long  time  (e.g.,  [58,  pp.  155-158])  to  learn  the  yet 
un-leamed  training  examples,  and  we  characterize  the  (differential)  learning  strategy  as  unreasonably  slow. 
If,  on  the  other  hand,  <!>(  ij^)  increases  polynomially  with  decreasing  il' ,  the  learning  strategy  is  reasonably 
fast. 

Section  D.3.2  proves  that  differential  learning  v  i  the  synthetic  CFM  objective  function  described  in 
section  D.l  is  reasonably  fast:  is  O  |  ii'~^  j .  This  characteristic  allows  hard  examples  to  be  learned  in 

reasonable  time  —  an  assertion  that  is  implicitly  illustrated  throughout  the  experiments  of  part  II.  Synthetic 
CFM  has  the  additional  property  of  being  synthesized  from  three  linear  functions  of  S  connected  by  two 
circular  arcs  (see  section  D.  I ).  Consequently,  all  derivatives  of  order  >  2  are  zero  for  most  values'^  of  <5 . 
As  t/’  becomes  small,  the  synthetic  function  becomes  approximately  piece-wise  linear  in  6  (see  figure  5.13, 
for  example).  Thus,  if  Sr(X.  |  S)  is  linear  in  9  —  as  it  is,  given  a  linear  hypothesis  class  —  learning  is 
very  fast,  even  for  hard  examples.  This  is  because  (5.91)  is  a  good  approximation  to  the  ideal  equation  for 
9[k  -I-  I]  implied  by  (5.87).  Specifically,  let  the  matrix  A  satisfy 


[V^  (CF:.1(5"|<I[A:1))]"A  =  I. 


(5.98) 


where  1  denotes  the  identity  vector.  Owing  to  the  approximately  piece-wise  linear  nature  of  U  \6\\p  l] 
and  the  choice  of  a  linear  hypothesis  class  (such  that  <$r(X  1 9)  is  linear  in  9 ),  (5.87)  reduces  to 


CFM  (5"  I  el*  -I-  1])  S  CFM(5"|e{*l)  -I-  [V^  (CFM  (5"|e(*l))]^  (»[jl  -f-  1]  -  (5.99) 


Thus, 


'^Ttiat  is,  all  higher-order  derivatives  are  zero  for  values  of  6  corresponding  lo  the  linear  segments  of  the  synthetic  function; 
higher-order  derivatives  corresponding  to  the  arc  segments  of  the  synthetic  function  are  non-zero.  As  V'  becomes  small,  the  arc 
segments  of  the  synthetic  function  also  become  small;  synthetic  CFM  becomes  approximately  piece-wise  linear,  and  all  its  higher-order 
derivatives  are  zero. 
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A  [CFM(5"|e[it  +  1))  -  CFM(5"|«[itJ)]  “  A  [Vg  (CFM  (5"  l^[itl))]^  (»[lk  +  1)  - 

s.t.  e[k  +  1]  “  9[k\  +  A  [CFM  (5"|^[*  +  1J)  -  CFM  (5"  !«(*))] 

' - ' 

A0[t] 


(5.100) 

Under  these  conditions,  for  which  both  synthetic  CFM  and  the  discriminator  are  linear  functions  of  9 , 
differentially  generated  linear  classifiers  exhibit  very  fast  learning.  Sections  7.7  and  8.4.2  illustrate  this 
phenomenon. 


5.5.2  DifTerenUal  Learning  via  the  Original  Forms  of  CFM  is  Unreasonably  Slow 
and/or  Inefficient 

We  were  motivated  to  develop  the  synthetic  form  of  CFM  because  the  original  functional  forms  described  in 
[55]  induce  either  unreasonably  slow  or  inefTicient  learning.  Figure  5.14  illustrates  these  functional  forms. 
The  original  functional  form  of  CFM  was  the  logistic  sigmoidal  form  on  the  left,  given  by 

<7  [^]  =  a  [1 +exp(-/3  •  (5  +  0]"' ,  (5.101) 

where  o  is  a  superfluous  linear  scaling  factor,  <  is  a  parameter  that  shifts  the  sigmoid  along  the  5  axis,  and 
^  is  roughly  equivalent  to  the  synthetic  form’s  confidence  parameter  t/’.  Section  D.3.1  proves  that  <^(/7) , 
the  ratio  of  the  function’s  derivative  for  transition  examples  to  its  derivative  for  undeamed  examples,  is 

<5(/))  =  C7[exp{l(5|^)|,  //  »  I,  <5  <  0  (5.102) 

Thus,  <i}[/3)  increases  exponentially  as  /i  is  increased  (i.e.,  as  the  function’s  equivalent  of  the  confidence 
parameter  goes  to  zero).  Under  these  circumstances,  transition  examples  dominate  the  learning  process,  and 
un-leamed  examples  are  effectively  ignored.  The  irony  here  is  that  P  must  be  increased  in  order  for  the  hard 
examples  to  be  learned,  but  increasing  P  also  ensures  that  it  will  take  an  unreasonably  long  time  to  learn  the 
hard  examples.  In  short,  differential  learning  via  the  original  logistic  sigmoidal  form  of  CFM  is  unreasonably 
slow,  failing  to  learn  hard  examples  in  reasonable  time  (e.g.,  [58,  pp.  155-158]). 

The  "maximally  flat’’  form  on  the  right  of  figure  5.14  is  given  by 

<7  [<5]  =  -Q  log  [1  +  (C  -  (5)^'’]  (5.103) 

where  a ,  < ,  and  P  have  the  same  interpretations  as  they  do  for  the  logistic  sigmoidal  form.  It  was 
developed  in  order  to  improve  the  learning  rate  for  hard  examples.  Unfortunately,  this  form  of  CFM  does 
not  converge  to  a  modified  Heaviside  step  function  as  /3  -»  oo  (i.e.,  as  its  equivalent  of  the  confidence 
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Figure  5.14:  Old  forms  of  the  CFM  objective  function,  described  in  [55].  Left:  The  logistic  sigmoidal  form 
satisfies  the  conditions  for  monotonicity,  but  leads  to  unreasonably  slow  learning.  Right:  The  “maximally 
flat"  form  induces  reasonably  fast  learning,  but  fails  to  satisfy  the  conditions  for  monotonicity. 


parameter  goes  to  zero)  —  a  necessary  characteristic  if  the  objective  function  is  to  be  monotonic  and  if  it  is 
to  induce  asymptotically  efficient  learning.  Thus,  differential  learning  via  the  maximally  flat  form  of  CFM  is 
reasonably  fast,  but  provably  inefficient. 

5.6  Summary 

The  link  between  an  objective  function’s  monotonicity  and  the  efficiency  of  the  learning  strategy  it  implements 
is  intuitively  appealing.  The  CFM  objective  function  induces  asymptotically  efficient  learning  —  regardless 
of  the  choice  of  hypothesis  class  —  precisely  because  it  is  monotonic.  Error  measure  objective  functions 
induce  inefficient  learning'^  because  they  are  non-monotonic.  In  fact,  the  monotonic  fraction  of  discriminator 
output  space  is  a  good  indicator  of  how  efficient  the  resulting  learning  strategy  will  be,  given  a  particular 
objective  function,  the  number  of  classes  C ,  and  a  hypothesis  class  that  is  assumed  to  be  an  improper 
parametric  model  of  the  feature  vector.  The  experiments  of  part  II  bear  this  ouL  Although  we  do  not  quote  the 
monotonic  fractions  explicitly,  it  is  straightforward  to  show  that  the  less  monotonic  objective  functions  induce 
less  efficient  learning  when  the  hypothesis  class  is  improper.  When  the  hypothesis  class  is  a  proper  parametric 
model  of  the  feature  vector,  the  non-monotonic  nature  of  some  error  measures  (i.e.,  those  associated  with 
maximum-likelihood  learning)  is  offset  by  the  parametric  model’s  functional  properties.  These  properties 

.  .exceix,  of  coutse,  for  the  case  in  which  the  hypothesis  class  is  the  proper  parametric  model  of  the  feature  vector  and  the  etror 
measure  is  associated  with  maximum-likelihood  learning. 
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constrain  the  discriminator’s  correct  output  states  to  lie  in  the  monotonic  region  of  “correct  space”. 

In  order  to  be  truly  monotonic,  the  objective  function  must  under  all  circumstance  be  a  strictly  increasing 
or  strictly  decreasing  function  of  the  classifier’s  empirical  training  sample  error  rate.  When  the  learning  task 
includes  hard  examples,  the  CFM  objective  function  must  approach  a  limiting  step  functional  form  in  order 
to  be  monotonic  (and  induce  efficient  learning),  liiis  requirement,  in  turn,  raises  the  issue  of  the  learning 
rate.  The  synthetic  form  of  CFM  detailed  in  appendix  D  allows  hard  examples  to  be  learned  in  reasonable 
lime.  Consequently,  the  synthetic  form  is  superior  to  the  original  functional  forms  described  in  [55],  which 
induce  unreasonably  slow  and/or  inefficient  learning.  The  difference  between  reasonably  fast  learning  and 
unreasonably  slow  learning  is  not  to  be  underestimated  in  the  case  of  hard  examples.  Many  of  the  results  in 
part  II  would  be  worse  by  a  statistically  significant  margin  if  hard  examples  were  not  leamable  in  reasonable 
time  via  synthetic  CFM. 


Chapter  6 

An  Information-Theoretic  View  of 
Stochastic  Concept  Learning ' 


Oi  .dne 

We  make  a  distinction  between  the  probabilistic  information  content  and  the  discriminant  information  content 
of  a  randomly-drawn  training  sample,  the  former  being  associated  with  probabilistic  learning,  and  the  latter 
being  associated  with  differential  learning.  We  show  that  a  simple  unfair  (or  “rigged”)  game  of  dice  forms 
the  basis  of  all  leaming/statistical  pattern  recognition  tasks.  We  analyze  this  game  in  order  to  prove  that 
the  discriminant  information  contained  in  a  training  sample  is  always  at  least  as  great  as  the  probabilistic 
information  contained  therein.  The  information-theoretic  argument  relies  on  Rissanen’s  notion  of  stochastic 
complexity  (e.g.,  [115])  and  can  be  viewed  as  an  extension  of  the  chapter  3  proofs  that  differential  learning 
is  I )  asymptotically  efFicient,  and  2)  requires  the  least  functional  complexity  necessary  to  generate  a  Bayes- 
optimal  classifier.  We  derive  tight,  distribution-dependent  lower  bounds  on  the  functional  complexity  and 
training  sample  size  necessary  for  “winning"  the  dice  game  via  the  differential  and  probabilistic  learning 
strategies.  The  differential  learning  strategy’s  functional  complexity  and  sample  size  requirements  are  usually 
less  (and  never  more)  than  the  probabilistic  learning  strategy’s.  We  show  how  simple  extensions  of  the  single 
die  paradigm  can  lead  to  analogous  lower  bounds  on  the  hypothesis  class’s  functional  complexity  and  the 
training  sample  size  necessary  for  good  generalization  in  leaming/pattem  recognition  tasks.  We  conclude  by 
discussing  the  limitations  of  this  generalization  to  the  uncountable  feiture  vector  space. 

6.1  Introduction 

The  essence  of  this  chapter  lies  in  the  analysis  of  a  rigged  (or  unfair)  game  of  dice  that,  in  turn,  lies  at  the 
heart  of  all  stochastic  concept  leaming/statistical  pattern  recognition  tasks.  Specifically,  we  have  a  C  sided 


'This  chapter  is  a  levised  version  of  work  first  published  in  |SI| 
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die.  Each  face  of  ihe  die  has  some  probability  of  turning  up  when  the  die  is  cast  (i.e.,  tossed).  One  face  is 
always  more  likely  to  turn  up  than  any  of  the  others;  thus,  all  the  face  probabilities  may  be  different,  but  at 
most  C  —  1  of  them  —  the  lesser  probabilities  —  can  be  identical.  The  objective  of  the  game  is  to  identify 
the  most  likely  face  with  specified  high  confidence  by  observing  a  sequence  of  independent  casts  of  the  die. 

The  relationship  between  this  rigged  die  paradigm  and  learning  stochastic  concepts  for  statistical  pattern 
recognition  becomes  clear  if  one  realizes  that  a  single  unfair  die  is  analogous  to  a  specific  point  on  the 
domain  of  the  random  feature  vector  being  classified.  Just  as  there  are  specific  a  posteriori  class  probabilities 
associated  with  a  point  in  feature  vector  space,  a  die  has  specific  probabilities  associated  with  each  of  its 
faces.  The  number  of  faces  on  the  die  (f )  equals  the  number  of  classes  associated  with  the  analogous 
point  in  feature  vector  space.  Identifying  the  most  likely  die  face  is  equivalent  to  idt  .Tying  the  largest  a 
posteriori  class  probability  for  the  analogous  point  in  feature  vector  space  —  the  requirement  for  Bayesian 
discrimination,  as  described  in  chapter  2. 

We  begin  by  defining  two  measures  of  functional  complexity,  based  on  Rissanen’s  definitions  of  stochastic 
complexity  and  minimum  description  length  (112,  113,  1 14,  1 15].  Probabilistic  complexity  (definition  6.1) 
is  the  stochastic  complexity  measure  associated  with  probabilistic  learning,  whereas  differential  complexity 
(definition  6.2)  is  the  stochastic  complexity  measure  associated  with  differential  learning.  The  relationship 
between  probabilistic  and  differential  complexity  parallels  the  relationship  between  the  strictly  probabilistic 
and  differential  forms  of  the  Bayesian  discriminant  function  described  in  section  2.2. 1 . 

We  analyze  the  rigged  game  of  dice,  proving  that  it  requires  only  one  bit  of  differential  complexity  to 
learn  the  identity  of  the  most  likely  die  face  differentially  from  a  sequence  of  independent  die  casts;  in 
contrast,  it  typically  requires  more  (and  never  requires  less)  differential  complexity  to  learn  the  identity  of  the 
most  likely  die  face  probabilistically  via  the  same  sequence  of  independent  die  casts.  Moreover,  the  identity 
of  the  most  likely  die  face  becomes  empirically  evident  with  far  fewer  casts  of  the  die  than  arc  required  to 
estimate  the  die’s  face  probabilities  with  specified  precision.  In  more  formal  terms  associated  with  learning 
for  statistical  pattern  recognition,  the  discriminant  information  content  of  a  sequence  of  independent  die  casts 
is  usually  higher  (and  never  less)  than  its  probabilistic  information  content.  These  information-theoretic 
proofs  are  analogous  to  the  estimation-theoretic  proofs  of  chapter  3  that  differential  learning  is  asymptotically 
efficient  (Theorem  3.1),  requiring  the  hypothesis  class  with  the  least  functional  complexity  necessary  for 
Bayesian  discrimination  (Corollary  3.1). 

Following  these  arguments,  we  formulate  tight,  distribui>''n-dependent  lower  bounds  on  the  functional 
complexity  and  the  number  of  casts  necessary  to  identify  ic  most  likely  die  face  empirically  with  high 
confidence.  We  show  how  simple  extensions  of  the  single  die  paradigm  can  lead  to  analogous  lower  bounds  on 
the  hypothesisclass's  functional  complexity  and  the  training  sample  size  requirements  for  good  generalization 
in  leaming/pattem  recognition  tasks.  We  conclude  by  describing  the  limitations  of  the  generalization  to  the 
uncountable  feature  vector  space:  since  the  rigged  die  paradigm  is  fundamentally  discrete,  it  over-estimates 
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—  perhaps  by  orders  of  magnitude  —  the  training  sample  size  requirements  it  predicts  are  necessary  for 
generalizing  well  with  uncountable  feature  vectors. 

6.2  Probabilistic  versus  Diflferential  Complexity 


Axiom  6.1  We  view  the  number  of  bits  in  the  finite-precision  approximation  9Ar[jc]  to  the  real  number 
X  €  (—1.1]  as  a  measure  of  the  approximation's  fimctional  complexity.  That  is,  the  functional  complexity  of 
the  approximation  is  the  number  of  bits  with  which  it  represents  the  real  number. 

Wecomputethe  A/,-bit  approximation  quiz]  to  the  real  number  z  €  (-1.1]  using  the  following  fixed-point 
representation: 


MSB  (most  significant  bit)  =  sign[z) 
MSB -I  =  2-' 


(6.1) 


LSB  (least  significant  bit)  =  2 

The  specific  value  of  quiz]  is  the  mid-point  of  the  -wide  half-open  interval  on  which  z  is  located;^ 

r  sign(z]  •  ([|z|  •  2'"r-"J  •  2-'"*-"  -I-  2-*^*).  |z|  <  I 


qstlzl  =  {  (6.2) 

[  sign(z]  •  (I  -  2  |z|  >  I 

The  lower  and  upper  bounds  on  the  quantization  interval  are  Lm^\z]  and  (/m,[z]  ,  such  that 

(-Mflz]  <  z  <  Vm,[z\  (6.3) 

Lm,[z]  =  quiz]  -  1-**'  (6.4) 

(6.5) 


The  fixed-point  representation  described  by  (6.1)  —(6.5)  differs  from  standard  fixed-point  reptesentations 
in  its  choice  of  quantization  interval.  The  choice  of  (6.2)  —(6.5)  represents  zero  as  a  non-positive,  finite 
precision  number  (i.e.,  by  (6.2),  9m[0]  =  — 2~**«). 

Note  that  if 

^The  notation  [rj  denotes  ihc  largest  decimal  integer  not  greater ihan  r.  and  the  notation  fr)  denotes  the  smallest  decimal  integer 
not  less  than  z  (e  g..  [75.  pg.  .t7|). 
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Zi  =  zo  +  (6.6) 

(i.e.,  if  zo  and  zi  are  in  adjacent  quantization  intervals  for  Af^-bit  quantization)  then 


(6.7) 

Given  this  information-theoretic  measure  of  functional  complexity,  we  define  two  functional  com¬ 
plexity  measures:  one  is  for  the  a  posteriori  class  probabilities  {Pwix{^^i  |X) . Pwix(^clX)} 

of  the  (J-class  random  feature  vector  X;  the  other  is  for  the  a  posteriori  class  differentials 

{A)v|x(^i  |X) . Aw|x((^c|X)}  of  X.  Both  definitions  are  based  on  Rissanen’s  notion  of 

stochastic  complexity  and  minimum  description  length  [  1 1 2,  1 1 3, 1 1 4,  1 1 5] 

Definition  6.1  Probabilistic  Complexity  Mqp :  The  probabilistic  complexity  of  an  approximation  to  the 
a  posteriori  class  probability  Pw\%{fUi  |  X) ,  lying  on  the  half-open  interval  {l,u] ,  is  the  minimum  number  of 
bits  Mqp  necessary  to  ensure  that  I X))  lies  betM’een  the  lower  and  upper  bounds  I  and  u: 


I  <  z.«^(Pw|,{a;/|X)]  <  PwiK(a;.|X)  <  i/Ar^(Pw|.(u;,|X)l  <  « 

/<M.0</<I,0<M<1 


(6.8) 


Remark:  Thus,  our  definition  of  probabilistic  complexity  is  identical  to  Rissanen’s  definition  of  stochastic 
complexity,  applied  to  approximating  the  feature  vector’s  a  posteriori  class  probabilities. 

Definition  6.2  Diflerential  Complexity  :  The  differential  complexity  of  an  approximation  to  the  a 
posteriori  class  differential  A  w|i(tc7f  I X) ,  lying  on  the  half-open  interval  {l'.u'\ ,  is  the  minimum  number  of 
bits  necessary  to  ensure  that  <7m,a[Aw|x((<7,  |X)]  lies  between  the  tower  and  upper  bounds  f  and 


e  <  LM,JAw|.(a^,|x)j  <  Awi,(u;.|X)  <  f/„,JAw|,(c«2,|X)l  <  «' 

/'<!(',  -  I  <  /'  <  1  ,  -  1  <  m'  <  I 


(6.9) 


Remark:  Our  definition  of  differential  complexity  is  consistent  with  Rissanen’s  definition  of  stochastic 
complexity,  but  it  focuses  on  approximating  the  feature  vector’s  a  posteriori  class  differentials  rather  than  its 
a  posteriori  class  probabilities. 

The  relationship  between  probabilistic  and  differential  complexity  is  formalized  in  section  6.3.3. 
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6.3  Exploring  the  Curious  Relationship  Between  Winning  a  Rigged 
Game  of  Dice  and  Building  an  Efficient  Classifler 

Consider  (he  C-sided  die  X  with  face  probabilities  {Pn‘|x(^i  I X) ,  ...  ,  Pw|x{t^c  I X)}  ,  which  sum  to 
one:  Pvv|x{^i  I X)  is  the  probability  that  the  /th  face  UJj  will  turn  up  on  any  given  cast  of  the  die.  We 
assume  that  one  die  face  is  more  likely  than  all  the  others.  Using  the  notational  conventions  of  section  2.4, 
we  denote  the  probability  of  this  most  likely  face  by 

Pvv|x(^^(il  |X)  =  max  P)v  |x(^*  IX),  (6.10) 

More  generally,  Pvv|x(^0il^)  the  probability  of  the  yth  most  likely  face  CJy) ,  whereas 

Pw|x(t^y  I X)  merely  denotes  the  probability  of  the  face  with  the  randomly-assigned  identifying  index  j 
(i.e.,  UJj ). 

Notational  Conventions  for  Empirical  Estimates  of  Probabilities:  Given  n  casts  of  the  die  in  which  the 
ithface  turns  up  k,  times,  we  employ  the  following  notational  conventions: 

The  integer  ki  denotes  the  number  of  occurrences  of face  UJj ,  such  that 

p»v„(a;HX)  =  ^ 

denotes  the  empirical  estimate  of  Pvv|n{^i  I X).  The  integer  k(,)  denotes  the  number  of  occurrences  of  the 
ith  most  likely  face  U^) ,  such  that 

Ptvi.(a;<,)|x)  =  ^ 

denotes  the  empirical  estimate  of  Pvv|»(^{(»  IX).  The  integer  k.^(,)  denotes  the  number  of  occurrences  of 
the  ith  empirically  most  likely  face  UJ^a) ,  such  that 

I X)  =  — ^ 

ft 

denotes  the  jth  empirically-rankeJ face  probability  estimate.  Note  that  the  /th  empirically  most  likely  die 
face  UJ,^{i)  b  not  necessarily  the  same  face  as  the  true  /th  most  likely  die  face  UJf,-)  for  a  finite  number 
of  die  caste  n. 


The  empirical  estimates  of  the  face  probabilities  arc  multinomially  distributed  [33],  such  that  for  n  casts  of 
the  die,  the  probability  that  there  will  be  exactly  k,  occurrences  of  the  ith  face  is 
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P.(«  •  Pn|x(a;,  |X)=i,,/i  •  Pw|x{Cc;2|X)  =  A:2 . n  ■  P>v|x(t^dX)  =  <:f)  = 

,  .  |-iC  Pw|X>^'|Xl‘' 
"•  11/=  I  *,! 


(6.11) 


where 

c 

Y,  =  «  (612) 

/=! 

The  question  we  would  like  to  answer  is,  "How  many  casts  of  the  die  must  occur  before  the  most  likely  die 
face  I)  becomes  empirically  evident  with  probability  a Given  n  tosses,  we  can  identify  the  most  likely 
die  face  by  first  estimating  the  probability  of  each  face 


Pw|x(U),|X)  =  |;  (6.13) 

we  then  rank  these  estimates  and  choose  the  empirically  most  likely  face  (*)„,(!)  (i.e.,  the  one  with  the 

largest  estimated  probability  Pw|x((‘^~(i)  |X))  as  our  estimate  for  U)(|) .  We  refer  to  this  strategy  as  the 
probabilistic  strategy  for  learning  the  most  likely  die  face  through  empirical  observation.  Another  way  to 
identify  the  most  likely  die  face  is  to  estimate  the  discriminant  differential 


<J.(X)  =  Pw|x(Cc'/|X)  -maxP»v|x(u^;|X)  =  ^ - ma^ 

i^i  n 

for  each  of  the  C  die  faces.  Only  one  of  these  (i.e.,  6^{i){X) )  will  be  positive,  and  it  will  be  associated 
with  the  empirically  most  likely  die  face  — our  estimate  for  .  We  refer  to  this  strategy  as  the 
differential  strategy  for  learning  the  most  likely  die  face  through  empirical  observation.  We  analyze  the 
differential  strategy  first  and  follow  with  an  analysis  and  comparison  of  the  probabilistic  strategy. 


6J.1  The  Differential  Mechanism  by  which  the  Most  Likely  Die  Face  Becomes 
Empirically  Evident 

Consider  the  C  d  posteriori  class  differentials,  originally  defined  in  (2. 1 3): 


Aw|x(u;/|X)  =  Ph-|x(U^,|X)  -  iraxP»v|x((i2,|X)  i=  I . C  (6.15) 


Note  that  when  /  =  ( I ) 


^»V|x((^(ll  |X)  =  Pw|x((^(ll  |X)  -  Pw|x{(*^(2>  |X), 


(6.16) 
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and  when  /  ^  ( I ) 


^»v|x{^<^(i)  I X)  —  P>v|x(^(i)  I X)  —  Pw|x(^(i)|X)  Vi  >  2  (6.17) 

Note  also,  by  (6.15), 


^W|x(^(2l|X)  =  -AH'|x(t^(I)  |X) 


(6.18) 


Thus, 


c  «•) 

^vv|x(^i|X)  =  Pvv|x{t*^i I X)  -  {C-2)  •  Pvv|x(t<^(i)  |X),  (6.19) 

i=l  f=(.M 

and  we  can  use  the  relationship  I  Pvv|x(^i  iX)  =  I  to  show  that  the  C  a  poj/erion  class  differentials 
of  (6. 1 5)  yield  the  C  die  face  probabilities  as  follows: 


Pw|x(^(i)  |X) 


•  ~  53  ^vv|x(^^i  |X) 

i=(2) 


Pwix(t‘^(/)  1^)  =  ^H’|x(Wy)  jX)  +  Pw|x(t«<’(t)  |X)  Vy  >  2 


(6.20) 


From  this  perspective,  estimating  the  C  a  posteriori  class  differentials  with  high  precision  is  equivalent 
to  estimating  the  C  die  face  probabilities  with  high  precision,  and  vice-versa.  However,  since  we  need  only 
know  the  signs  of  the  a  posteriori  class  differentials  to  identify  the  most  likely  die  face 


^w|x(^(i»|X)  >  0 

A,v|x{a;(/,|X)  <0  vy>2. 


(6.21) 


we  need  only  estimate  each  a  posteriori  class  differential  to  one  (sign)  bit  precision  in  order  to  identify 
the  most  likely  die  face.  SpeciFically,  in  order  to  correctly  identify  the  most  likely  die  face  02(1)  with 
probability  at  least  a,  using  the  discriminant  differentials  {($.^.(|)(X), ...  ,(^.,.(C)(X)},  we  need  only  ensure 
that  the  die  is  cast  enough  times  so  that  the  identity  of  the  empirically  most  likely  die  face  is  that  of  the  true 
most  likely  die  face  (i.e.,  02.^0)  -»  02(t) )  with  probability  at  least  a.  If  this  is  the  case,^  the  signs  of 
the  discriminant  differentials  {<$(|)(X) , . . .  ,<$(c)(X)}  match  the  signs  of  their  corresponding  a  posteriori 

class  differentials  {Aw|x(^(i)  I X) . ^>v|x(^(C)  I X)}  —  recall  that  this  is  the  condition  of  (2.17) 

necessary  for  Bayesian  discrimination: 


^Tlw  noiaiional  conventions  for  the  discritninani  diffeientials  follow  those  for  the  face  probabilities  described  in  the  note  on  page  16.1. 
Thus.  SjiX)  is  associated  with  Ulj.  is  associated  with  U2(^)  .and  6...|^)(X)  is  associated  with  Ci2...o). 
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P  (sign[5,,,(X))  =  sign[A,A  |x(<^(0  |X))  ^0  = 

(5m„{X)  =  6(„(X)  >0  \  (6.22) 

I  >  « 

<S,;,(X)  <0  V;  >  2  / 


( 


When  (6.22)  holds,  the  largest  empirical  face  probability  reflects  the  identity  of  the  true  most  likely  die  face; 


P  (Pvv|x(t^~(i)  |X)  =  Pw|x(^(i)|X)  >  P»v|x((^0)  1^)  «  (6.23) 

The  computation  to  determine  the  smallest  number  of  die  casts  n  for  which  (6.22)  and  (6.23)  are  satisfied  is 
intractable  for  all  but  small  values  of  C  and  n .  For  this  reason,  we  make  the  following  assumption. 


Assumption  6.1  We  assume  that  the  empirical  estimate  of  the  most  likely  die  face's  probability  b  greater 
than  the  empirical  estimates  of  all  other  die  face  probabilities  if  it  is  greater  than  the  empirical  estimate  of 
the  second  most  likely  die  face ’s  probability.  Mathematically, 

We  assume  that  if  Pwii(^(t))^)  >  P>v|i{ti^(2)  I X) 

(6.24) 

then  p»v,.(a;,„|X)  >  p,vi.(a;o)|X)  y/>2 

Under  this  assumption,  the  multinomial  expansion  implied  by  (6.22)  and  (6.23)  can  be  simplified  as 
follows.  Note;  we  use  the  short-hand  notation  P(Wy)  to  denote  P>v|x(f^;  |X)  and  P(ti2;)  to  denote 
Pw|x(^>  I X) ;  this  makes  the  following  formula  more  readable. 


p  (p(u;<„)  >  P{U(j))  vy  >  2)  = 

P(5(„(X)  >  0,(5o,(X)  <0  V;  >  2)  (6.25) 

s  p(p(a;(,))  >  P(c^(2)))  (6-26) 

fl 

=  p("  •  P(^(l))  =  *(!)•"  ■  P(^<2))  <*(!)) 


P(u;(l))**'> 

*(|)! 


E 


P(cd(2))*»*  (1  -P(u;(i))  -  P(cc;(2) 

*(2|!  ("-*(1)  -*(2))! 


(6.27) 


>  a 
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where^ 


■^1 


I’l 


■^2 


f2 


r  I .  c  =  2 

I  0  ’  ^’  >  2 

n 

0 

min  (*(!)-  l,n-kn)) 


(6.28) 

(6.29) 

(6.30) 

(6.31) 


The  sufficiency  of  (6.25)  —  (6.27)  as  an  indicator  that  we  have  correctly  identified  the  most  likely  die 
face  rests  on  the  validity  of  assumption  6.1,  which  allows  the  reduction  of  (6.25)  to  (6.26).  The  upper 
bound  on  k(2)  in  (6.27),  given  by  t’2  in  (6.31),  simply  enforces  the  constraint  of  (6.12)  — namely, 
that  all  the  ks  sum  to  ri.  The  lower  bound  on  kd)  in  (6.27),  given  by  A|  in  (6.28),  is  related  to  the 
reduction  of  (6.25)  to  (6.27).  If  A|  were  I,  then  (6.27)  would  be  an  exact  expression  of  (6.26).  However, 
we  really  want  (6.27)  to  be  a  good  approximation  to  (6.25).  A  necessary  and  sufficient  condition  for 

Pwix(^(i)  |X)  >  Pvv|x(^0i  1^)  ^  is  that  the  most  likely  die  face  occur  more  frequently  than  any 

other.  Thus,  a  necessary  (albeit  insufficient)  condition  for  Pw|x(^(i)  I  >  Pw|x{i^(»  I  ^  2  is 


Y,  *0)  =  "  -  *«)  -  *(21  <  {C-  2)k,„  VC  >  2  (6.32) 

>=.1 

such  that 

*("  >  VC  >  2  (6.33) 

—  the  constraint  enforced  by  (6.28).  Clearly  there  will  be  cases  for  which  (6.33)  holds,  but  our  assumption 
that  Pw|x(^(i)  |X)  >  Pw|x(<^0)  1^)  j  ^  ^  "o*  (i  *-»  (6-31)  is  a  necessary  but  insufficient 

condition  for  assumption  6.1  to  hold  in  all  cases).  Thus  it  is  important  to  qualify  assumption  6.1  with 
the  scenarios  under  which  it  ndght  fail,  yielding  a  higher-than-warranted  estimate  of  the  probability  in 
(6.25).  We  envisage  two  such  scenarios:  I )  the  scenario  in  which  n  is  small,  and  2)  the  scenario  in  which 
Pw|x(^(2)  1 3C)  -*  Pw|x((^(C)  1 )  are  nominally  equal.  Both  scenarios  can  give  rise  to  cases  in  which 

the  empirical  estimates  of  {Pw|x(^(2)  |  X) , . . .  ,Pw|x((^(r)  I X)}  do  not  reflect  the  rankings  of  the  true 
probabilities  {Pw|x(^(2)  |X), ...  ,P>v|x(t4^(C)  |X)} . 

The  significance  of  the  first  scenario  is  diminished  by  the  requirement  that  Pw|x(^(i)  I X)  be  consid¬ 
erably  greater  than  Pw|x(^(2i  |X)  in  order  for  their  corresponding  estimates  to  be  ranked  appropriately 
^Theoperilor  max{a,b)  returns  (he  greater  of  a  and  h .  Likewise,  the  operator  minta.h)  retarns  the  lesser  of  o  and  h. 
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with  high  probability  for  small  n.  This  would  suggest  that  our  estimate  of  the  probability  in  (6.25)  is  not 
significantly  affected  by  assumption  6. 1 . 

In  the  second  scenario,  if  Pwix(^*^(i)  I X)  is  only  marginally  larger  than  P)v|x(l^(2)  I X) ,  then  n  will 
have  to  be  so  large  to  assure  with  high  confidence  that  the  rankings  of  P)v{x(^(n  I X)  and  Pn|x(^(2)  |X) 
are  valid,  the  effect  of  the  lesser-ranked  probabilities  on  the  approximation  of  (6.27)  will  be  minimal.  If  on 
the  other  hand  Pw|x(^(t)  I X)  is  significantly  larger  than  P)v|x(^(2)  I X) ,  then  this  scenario  becomes  a 
sp^ial  case  of  the  first  scenario,  and  again  we  are  reasonably  safe  in  our  assumption. 

These  mitigating  factors  notwithstanding,  there  are  surely  cases  in  which  assumption  6. 1  does  not  hold. 
In  these  cases,  (6.27)  over-estimates  the  probability  that  the  most  likely  die  face  has  become  empirically 
evident  after  n  casts  of  the  die. 

It  is  enlightening  to  recognize  that  the  necessary  condition  for  a  reliable  I -bit  approximation  of 
^w|x(^(t)  I X)  given  in  (6.25)  is  equivalent  to  the  condition 

P(0  <  5,,t{X)  <  1)  >  «  (6.34) 

This  condition  is,  in  turn,  a  specific  example  of  the  more  general  necessary  condition  for  a  reliable  Af,,-bit 
approximation  <$(i)(X)  of  Aw|x(^(i)  |X); 

P(^««4(^w|x(^(ii  |X)1  <  <5(|)(X)  <  I/M,4[Aw|x(tt’(i)  |X)])  >  o  (6.35) 

where  and  are  the  lower  and  upper  bounds  on  the  Af,-bit  quantization  interval  for 

9a#,a[^w|x(^(|)  |X)|  (recall  definition  6.2).  Given  a  positive  integer  «  and  a  real  number  z  G  (—1,1], 
let  be  the  smallest  value  of  k  for  which  ^  U)  •  where  Z.M,a[zj  is  given  by  (6.5);  likewise,  let 

kvu^^  {zj  be  the  largest  value  of  k  for  which  J  [z]  ,  where  [z]  is  given  by  (6.5).  More  formally, 

(6.36) 

Then  if  zo  and  zi  are  in  adjacent  quantization  intervals,  with  zi  in  the  “upper”  interval  (as 

formally  specified  by  (6.6)  —  (6.7)),  the  following  relationship  holds: 

=  *t/*,*N  +  I  (6.37) 

With  these  relationships,  (6.25)  —  (6.3 1 )  can  be  generalized  as  follows.  Note:  again,  we  use  the  short-hand 
notation  P(Ci7y)  to  denote  Pw|x(^y|X)  and  P(U)y)  to  denote  Pw|x(l^>|X).  Likewise,  we  use  A(Ct)y) 
to  denote  Aw|x(^Ci)/ 1 X)  in  the  interest  of  notational  simplicity. 
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P(z.«,JA{a;,„)l  <  (5,„(X)  <  <  0  Vy  >  2)  = 

P  (^w,a(^(^(|))1  <  <^(il(X)  <  t^*/,A[A(u;(i,)])  (6.38) 

=  P(0  <  <5,„(X)-L„,jA(u;,„)j  <  t;„,JA(uJ,n)l-i«,AlA(u;„))l) 

=  p(p{aj(2))  <  P(a»(i)) -LA/,^[A(a;(i)))  <  P(u>(2i)  +  t//ir,4(A(a>(i))l-L«,^[A(cx^(i))]j 

=  P^P(a;(2))  <  P(a>(i)) -La/,a(‘^(^(|))I*P(^(2))  ^ 


»’l 

*(11  =  ^1 

P{t^d))*''' 

P(u;,2,r'  (I  -P(a;,„)- p(u;, 2, 

(6.39) 

*(i.! 

in  -  k^t)  -  ka)V. 

> 

0 

where 

(  C  -  2 

'  1  n>ax(*/*,^[A(a>d,)),2^  -1-  1)  .  C  >  2 

(6.40) 

i>i  =  n 

(6.41) 

A2  =  max 

(o.kd)  -*y*^^(A(u;,i,)]) 

(6.42) 

V2  =  min 

(*dl  -*t.,,^[A(Cc;d))),n-*d)) 

(6.43) 

The  limits  of  (6.40),  (6.4 1 ),  and  (6.43)  conespond  to  those  in  (6.28),  (6.29),  and  (6.3 1 );  the  limit  of  (6.42)  is 
an  additional  one  imposed  by  our  desire  for  m  Mq  >  I  bit  approximation,  which  places  both  lower  and 
upper  bounds  on  the  discriminant  differential  approximation  ($d)  (X) . 

Tbeorcm  6.1  The  differential  learning  strategy  attempts  to  approximate  the  a  posteriori  class  differentials 

{Aw|i(Ct>d)  |X) . Aw|«((x^(c)  |X)}  to  one  (sign)  bit  precision.  The  probability  that  differential 

learning  will  achieve  this  goal,  given  a  sample  size  n ,  is  higher  than  the  probability  of  success  for  any  other 
learning  strategy  by  which  the  most  likely  die  face  Woi  might  be  ascertained 

Proof  :  As  mentioned  earlier,  the  necessary  and  sufficient  condition  for  the  most  likely  die  face 

to  be  empirically  evident  after  n  casts  of  the  die  is  simply  that  the  number  of  occurrences  of  the 
most  likely  face  be  greater  than  the  number  of  occurrences  of  any  other  face.  Equivak  y,  the  dis¬ 
criminant  differentials  {6d)(X) . 6(oi(X)}  must  approximate  the  a  posteriori  class  differentials 
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{Aw|x(^<^(i)  |X),  ...  ,Aw|x(^(C)  |X)}  to  at  least  one  (sign)  bit  precision.  Thus,  if  we  employ  the 
differential  strategy  of  approximating  i.ie  a  posteriori  class  differentials,  we  end  up  choosing  the  empirically 
most  likely  face  as  our  estimate  for  the  true  most  likely  face  U)(|) .  Our  probability  of  success  using 

this  strategy  (i.e.,  successfully  approximating  the  a  posteriori  class  differentials  to  one  bit  precision,  thereby 
identifying  (Jd)  correctly)  is  equal  to  the  probability  of  all  possible  ca&.  sequences  of  length  n  for  which 
^(1)  is  maximal.  This,  in  turn,  is  equal  to  (he  sum  of  all  legal  expressions  (that  is,  all  expressions  in  which  the 
ks  sum  to  n )  of  the  multinomial  probability  mass  function  (pmO  given  by  (6.1 1)  and  (6.12).  We  re-express 
(he  multinomial  pmf  here,  using  rank  indices: 


P(*(i 


1=1 


"‘(0- 


^  *(.)  =  « 


1=1 


(6.44) 


Since  the  multinomial  pmf  in  (6.44)  is  non-negative,  and  since  the  sum  of  all  the  pmf  terms  satisfying  the 
constraints 


*(i)  >  *(/)  '^7^2;  Yi  *(<)  =  «  (6.45) 

t=i 

represents  (he  cumulative  probability  of  all  cast  sequences  of  length  n  for  which  U)(|)  is  empirically 
evident,  the  cumulative  probability  that  differential  learning  will  meet  its  goal  is  the  largest  possible  for  any 
strategy  by  which  the  most  likely  die  face  might  be  ascertained.  That  is,  the  differential  suategy  described  in 

theorem  6. 1  has  (he  greatest  probability  of  success  in  achieving  its  goal.  I 

Remark:  Unfortunately,  there  is  no  compact  way  to  express  the  cumulative  multinomial  probability 
described  in  the  preceding  proof.  The  only  way  to  compute  the  sum  exactly  is  to  evaluate  each  and 
every  possible  combination  of  ks  to  determine  if  it  satisfies  the  consuaints  of  (6.45);  if  it  does,  then 
the  corresponding  multinomial  expression  is  added  to  the  cumulative  sum.  As  mentioned  earlier,  this 
computation  is  intractable  for  all  but  very  small  values  of  C  and  n .  We  resort  to  the  approximations  of 
(6.27)  and  (6.39)  in  order  to  estimate  (he  probabilities  of  (6.25)  and  (6.38)  via  a  tractable  computation  (see 
section  6.4).  It  is  interesting  to  note  that  the  logic  of  the  preceding  proof  holds  for  the  approximations 
as  well.  Specifically,  it  should  be  clear  from  a  comparison  of  (6.27)  —(6.31)  and  (6.39)  —(6.43)  that 
the  probability  P  largest  for  a  given  sample  size  n 

when  =  1  such  that  =0  =  •  ('  C-. 

A:(/m^^[A(C(7(|))]  =  n).  In  simple  terms,  attempting  to  estimate  A(u)(i|)  — recall  that  this  notation  is 
short-hand  for  the  top-ranked  a  posteriori  class  differential  Aw|x(^(i)  |X)  — with  more  than  one  bit 
precision  constitutes  a  learning  strategy  with  a  lower  probability  of  success  than  (he  differential  learning 


6.3  Exploring  the  Rigged  Game  of  Dice 


171 


strategy.  This  is  because  the  goal  of  estimating  the  a  posteriori  class  differentials  with  higher  precision  places 
tighter  constraints  on  the  values  of  {/tn) , . . .  that  satisfy  the  more  rigorous  learning  goal.  Because 

there  are  some  cast  sequences  of  length  ii  that  satisfy  (6.45)  but  do  not  satisfy  the  more  stringent  goal,  there 
will  be  fewer  terms  in  the  multinomial  sum  for  the  higher-precision  strategy.  Thus,  the  higher-precision 
strategy  will  have  a  lower  probability  of  success. 

Corollary  6. 1  The  differential  strategy  for  learning  the  most  likely  die  face  requires  the  minimum  differential 
complexit}’  necessary  for  the  task. 

Proof  :  The  discriminant  differential  ^(|)(X)  need  only  approximate  the  a  postenori  class  differential 
Aw|x(^(i)  |X)  in  sign  in  order  for  the  most  likely  die  face  to  be  evident.  That  is.  the  necessary  and 
sufficient  condition  on  <5(i)(X)  for  correctly  identifying  the  most  likely  die  face  is 

sign[(5(,)(X)l  =  sign[A»v|x(u;(i)  |X)],  (6.46) 

which  follows  directly  from  the  constraint  kd)  >  >2  in  (6.45).  Equivalently, 

9Ar,A..,['^(i)(X)|  =  flM,4*,tAw|x{t^(ii  |X)] :  A/,a  min  1  (6.47) 

■ 

Remark:  Note  that  the  condition  of  (6.46)  is  precisely  that  of  (2.17). 

Corollary  6.2  The  differential  learning  goal  of  appro.ximating  Aw|*{t(7(t)  |  X)  to  at  least  one  (sign)  bit 
precision  with  confidence  not  less  than  o  requires  the  smallest  sample  size  n^  of  any  learning  strategy  by 
which  the  most  likely  die  face  C(7(| )  might  be  ascertained. 

Proof  :  The  proof  follows  immediately  from  the  proof  of  theorem  6.1.1 

6.3.2  The  Probabilistic  Mechanism  by  which  the  Most  Likely  Die  Face  Becomes 
Empirically  Evident 

As  described  earlier,  the  probabilistic  strategy  for  identifying  the  most  likely  die  face  involves  estimating  the 
C  face  probabilities  {Pw|x(^(i)  |  X) , . . .  ,P>v|x(^ic)  I X)) .  In  order  for  us  to  identify  the  most  likely 

face  Ci)(|)  with  probability  not  less  than  o,  we  must  distinguish  Prv|x(t^(i)  |X)  ft  jTI  P»v|x(^(2)  |X) 
with  probability  not  less  than  a .  This  implies  that  we  choose  a  quantization  level  Mqp  such  that 


P 


^*r,r[Pw|x(^^(>)  |X)]  <  P>v|x(^(()  |X)  ^  f^«^lPw|x(i^(i)  |X)] , 

<?M,r|P)V|x(^(l)  |X)1  >  <?M,,(P>V|x(t<7(2|  |X)) 


>  o 


(6.48) 
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The  minimuni  value  of  Mgp  that  satisfies  (6.48)  for  asymptotically  large  n  (by  making  the  quantized 
difference  <7Af,/.lPiv|x(^(i)  IX))  —  9A#,,[Piv|x(<-t^(2)  jX)]  implied  by  (6.48)  exceed  one  least  significant 
bit)  is 


M, 


qP  min  — 


I.  +  r  ~  log2  [^W|x(^(l)  |X))  1  , 

I,-  ' - - ' 

s'8"  magnitude  bits 

I.  +  r~log2  ['^w|x(^ii)  |X)]  1  +  I  , 
^’8"  magnitude  bits 


-  log2  [P)Vlx(^(j)  1  X)]  ^  , 

J  e  {1.2} 


otherwise 


(6.49) 


Recall  that  Z'^  represents  the  set  of  all  positive  integers.  Note  also  that  the  conditional  nature  of 
in  (6.49)  prevents  the  case  in  which  limf_*o  Pvv|x{t^(i)  |X)  -  e  =  Z-Af,[Pn|x{^(i)  |X)|  or 
Pw1x(^*2(2)  I X)  =  f/w,[Pw|x{^(2)  I X)];  either  case  would  require  an  infinitely  large  sample  size  before 
the  variance  of  the  corresponding  estimate  became  small  enough  to  distinguish  <7iMlPw|x{f*'(i)  I X))  from 
9m(Pw|x(^(2)  |X)].  The  sign  bit  in  (6.49)  is  not  requited  to  estimate  the  probabilides  in  (6.48),  since  all 
probabilities  are  positive;  it  merely  follows  the  conventions  of  section  6.2.  Thus,  the  probabilistic  complexity 
of  the  MtjP-b'ii  approximation  is  actually  M^p  —  I  rather  than  M^p . 

Using  =  M^pmin  from  (6.49),  (6.48)  is  given  by 


(Pw|x(^(()  |X))  <  P>v|x(^(«)  |X)  <  UM^fPjvjxCt^i,)  |X)], 
<7m^[Pvv|x(^(|)  |X)]  >  <7m,,(P»v|x(^(2)  |X)] 

<■(11  <•(<■1  c 


)■ 


'  Y.  -  Y.  s*'" 


=  n  (6.50) 


,=  i 


>  a , 


where 


|X)| 

(6.51) 

(6.52) 

Vj  Vy  <  2 

(6.53) 

when  we  require  that  each  of  the  approximations  of  Pw|x(^(i)|X)  fall  within  a  single  quan¬ 
tization  interval  with  probability  not  less  than  a.  This  is  the  case  if  our  goal  is  to  estimate 
{Pw|x(^^{0  |X) . Ph'|x{U,'(C)  |X)}  with  Af,p-bit  precision. 
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Ofcourse,  ifourgoal  is  merely  to  distinguish  between  the  quantized  approximations  9m[Pw|x(^(I)  1^)1 
and  9m[Pvv|x{^(2)  I X)]  — the  weakest  form  of  probabilistic  learning  — we  need  only  enforce  a 
lower  bound  on  the  approximation  of  Pw|x{^^(i)  I X)  and  an  upper  bound  on  the  approximation  of 
Pw|x(^(2)  |X)).  This  leads  to  the  following  approximate  formula  (in  short-hand  notation),  analogous  to 
those  of  (6.25)  and  (6.38): 


p(9„[P(a;,„)l  >  <7A/[P(a;o))l,  Vj>  i) 


rv 


P(Cc^(i))'"’ 

kiu'- 


E 


*(11  =  ^2 


A:(2)!  (n  -  *(1,  -  *(2))! 


(6.54) 


>  « 


where 


A,  =  rnax^e-bl.^!^-^ -I-  1^  VC  >  2  (6.55) 

i'(  =  «  (6.56) 

A2  =  0  (6.57) 

t'2  =  min  (B.n  —  kd))  (6.58) 

B  =  ka^^jPicda))]  =  -  «  (6-59) 


The  restriction  of  (6.55)  stems  from  (6.33),  since  this  condition  is  necessary  (although,  again,  insufficient)  to 
ensure  the  validity  of  the  approximation  in  (6.54). 

Equation  (6.59)  illustrates  that  there  is  one  and  only  one  boundary  B  separating  our  quantized  estimates  of 
Pwix{tt7(i)  |X)  and  Pw|x((^(2)  |X)  for  A/,,Pn,j„-bit  quantization.  If,  however,  we  use  (Af^p  >  AffPmin)-bit 
quantization  along  with  equations  (6.55)  —  (6.58),  there  are  many  boundaries  that  can  be  used  in  (6.54),  via 
(6.55)  and  (6.58).  Specifically,  there  is  a  simple  recursion  by  which  every  possible  boundary  B  for  M,p-bit 
quantization  leads  to  itself  and  two  additional  boundaries  for  (Mt,p  -I-  1  )-bit  quantization: 
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MqP-hit  quantization  (A/^/>  +  I  )-bit  quantization 

n 


B 


B  +  2-"^^  B  +  2-*V  <  L„,  +  ,  [P,v|x(Co», , )  i  X)] 
no  boundary’,  otherwise 


(6.60) 


8 


{ 


B  _  ,  e-  2-"»'  >  i[Pw|x(t^(2)  |X)1 

no  boundary  otherwise 


Indeed,  one  can  show  that  there  are 


I  [i,„,(P,v|x(W|ii|X)|  -U«,|Pw|x(W(!i|X)|]  ■  2'"^-"  +  1,  «,,  > 

|B».I  = 

0 ,  otherwise 

(6.61) 

members  ^  in  the  set  of  possible  boundaries  between  ^m[Pw|x(^(I)  |X))  and  <7m[Pwix(02(2)  |X)]  that  can 
be  used  for  B  in  equations  (6.55)  and  (6.58). 

Corollary  6.3  The  probabilistic  learning  strategy  attempts  to  approximate  the  a  posteriori  class  probabilities 

{/’>v|»(^(i)  |X) . Bw|»{t(7(c)  |X)}  to  a  specified  level  of  precision,  measured  by  the  probabilistic 

complexity  Mgp  —  1  of  the  approximations.  As  a  result,  the  strategy  also  attempts  to  approximate  the 
a  posteriori  class  differentials  {  A  w|x(t<7(i)  |  X) ,  . . .  ,  I X)}  to  a  specified  level  of  precision, 

measured  by  the  differential  complexity  =  M^p  of  the  approximations.  Probabilistic  learning  therefore 
requires  higher  functional  complexity  and  has  a  lower  probability  of  success  than  differential  learning. 

Proof  :  Since  differential  learning  attempts  to  approximate  the  a  posteriori  class  differentials  to  one  sign 
bit  precision,  the  increased  complexity  requirements  of  probabilistic  learning  (as  measured  by  its  differential 
complexity)  are  self  evident.  Whether  we  use  the  exact  expression  of  (6.50)  or  the  approximation  of  (6.54),  it 
is  also  clear  that  probabilistic  learning  places  tighter  constraints  on  the  summation  bounds  for  the  cumulative 
multinomial  expression.  Thus,  by  the  arguments  of  the  proof  to  theorem  6.1,  the  probability  that  probabilistic 
learning  will  achieve  its  more  precise  goal  of  estimating  the  die  face  probabilities  with  Mqp  bits  of  precision  is 
less  than  the  probability  that  differential  learning  will  achieve  its  minimum-complexity  goal  of  approximating 
the  a  posteriori  class  differentials  to  at  least  one  sign  bit  precision.  I 


’We  use  the  notation  |P»f,|  to  denote  the  corJino/i'O’ of  (i.e.,  the  number  of  elmients  or  members  in)  the  set  Bu,. 
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We  emphasize  that  the  probability  of  success  we  discuss  in  theorem  6,1  and  corollary  6,2  pertains 
to  the  goal  of  approximating  {Avv>|x(^^(n  |X) . Aw|x(t^(C)  I X)}  to  =  1  sign  bit  preci¬ 

sion,  This  is  the  same  as  the  goal  of  identifying  the  most  likely  die  face  Cc’(i) .  The  probability  of 
success  we  discuss  in  corollary  6.3  pertains  to  a  different,  more  complex  goal:  that  of  approximating 
{A>v|x(t^(i)  jX),  ,  Aw|x(^(C)  |X)}  to  >  1  bit  precision.  Corollary  6.3  therefore  does 

not  assert  that  probabilistic  learning  is  less  likely  than  differential  learning  to  identify  the  most  likely 
die  face  unless  the  probabilistic  complexity  allocated  for  the  learning  process  is  inadequate.  That  is,  if 
Mgp  <  MqP„i„  in  (6.49),  there  will  be  insufficient  functional  complexity  to  distinguish  Pw|x(^(i)|X) 
from  Pw|x(^(2)  I X) ,  and  probabilistic  learning  will  fail  to  identify  the  most  likely  die  face  CJ(|)  for  any 
number  of  of  die  casts.  If,  on  the  other  hand,  M^p  »  Mqp  ,  there  will  be  sufficient  functional  complexity 
to  distinguish  the  a  paste t  tori  class  probabilities  and  to  approximate  the  a  posteriori  class  differentials  with 
high  precision.  Thus,  if  the  most  likely  die  face  is  evident,  probabilistic  learning  will  identify  it  —  with  the 
same  probability  of  success  that  differential  learning  has  —  given  sufficient  functional  complexity. 

This  fact  makes  sense  intuitively:  the  n  casts  of  the  die  alone  deterinine  whether  or  not  UJ(])  is 
empirically  evident.  The  learning  strategy  by  which  we  estimate  the  identity  of  U7(|)  has  no  effect  on  the 
observable  statistics  of  the  game;  it  affects  only  what  we  can  infer  from  those  statistics.  The  advantage  of 
differential  learning  therefore  lies  in  its  ability  to  identify  the  most  likely  die  face  as  soon  as  it  becomes 
evident  in  the  sequence  of  casts,  while  requiring  the  least  differential  complexity  necessary  to  achieve  this 
goal.  Indeed,  theorem  6. 1  and  corollaries  6.  /  and  6.2  are  the  information-theoretic  equivalents  of  theorem  3. 1 
and  corollary  3.1.  The  minimum-complexity  requirement  of  differential  learning  is  an  advantage  from  the 
standpoint  of  generalization  since,  under  VC  analysis  1 1 37,  1 36],  excessive  complexity  is  anathema. 


6.3.3  Discriminant  Information  versus  Probabilistic  Information 

If  we  approximate  each  of  the  die  face  probabilities  {Pw|x(t4^(i(  |X), ...  ,Pw|x(tt^(C)  |X)}  with  Af,/.  = 
M  —  1  bits  of  probabilistic  complexity,*  then  we  approximate  each  of  the  a  posteriori  class  differentials 
{^wixC^t^ti)  |X), ...  ,  A»v|x(^(c)  |X)}  with  =  M  bits  of  differential  complexity.  This  follows 
immediately  from  (6. 1 5)  and  (6.20),  which  allow  us  to  express  the  differentials  in  terms  of  the  probabilities 
and  vice-versa.  The  relationship  between  probabilistic  and  differential  complexity  allows  us  to  make  a 
direct  comparison  between  the  functional  complexity  requirements  of  differential  learning  and  probabilistic 
learning. 


"'Recall  that  the  sign  bil  is  superfluous  when  the  riT-ft-poinl  binary  representation  described  in  section  6.2  is  being  used  to  approximate 
a  (non-negative)  probability. 
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The  Information-Theoretic  Argument  for  Differential  Learning 

Occam’s  razor  [130,  21]  stipulates  that  the  complexity  of  an  estimate’s  representation  need  not  and  should 
not  exceed  the  information  contained  in  the  empirical  data  sample  used  to  compute  the  estimate.  This  is 
bom  out  by  the  cumulative  probability  expressions  of  sections  6.3.1  and  6.3.2:  there  is  always  more  probably 
AfijAmm  =  I  bit  of  discriminant  information  in  n  casts  of  the  die  than  there  are  >  1  bits  of  discriminant 
information.  When  n  is  laigc,  such  that  the  a  posteriori  class  differentials  can  probably  be  estimated  with 
^  I  -bit  precision,  the  information  content  of  the  cast  sequence  justifies  our  doing  this.  Equivalently, 
we  are  justified  in  estimating  all  the  die  face  probabilities  with  high  precision.  However,  when  n  is  small  — 
as  it  almost  always  is  in  real-world  leaming/pattem  recognition  tasks  —  the  information  content  of  the  cast 
sequence  justifies  no  more  than  our  estimating  Aw|x((^(i)  |  X)  with  one  bit  of  precision.  Equivalently,  we 
are  justified  only  in  estimating  the  identity  of  the  most  likely  die  face  07(1) . 


6.4  Bounds  on  the  Training  Sample  Size  Requirements  of  the  Differen¬ 
tial  and  Probabilistic  Learning  Strategies 

Equation  (6.27)  is  an  approximate  expression  of  the  probability  that  the  discriminant  differential  (X) 
associated  with  the  most  likely  die  face  U(\)  will  be  positive  following  n  casts  of  the  die  X.  Equation 
(6.54)  is  an  approximate  expression  of  the  probability  that  the  estimate  Pw|x(t‘^(i)  |X)  will  be  greater  than 

the  threshold  value  f  and  the  estimate  P»v|x((^(2)  |X)  will  be  less  than  ®  following  n  casts  of  the  die. 
Thus  (6.27)  states  the  approximate  probability  that  the  goal  of  differential  learning  will  be  reached  in  n  casts 
of  the  die;  likewise,  (6.54)  states  the  approximate  probability  that  the  weakest  goal  of  probabilistic  learning 
will  be  reached  in  the  same  n  casts. 

These  two  equations  can  be  evaluated  numerically  in  order  to  estimate  via  iterative  search  the  minimum 
values  of  n  at  which  the  differential  (n^)  and  probabilistic  (rip)  learning  goals  will  be  reached  with 
specified  probability,  given  particular  values  of  Pw|x(td(i)  I X) ,  Pw|x(^(2)  I X) ,  and  C .  The  numerically 
estimated  values  of  n^  and  n/>  are  generally  quite  close  to  the  empirical  estimates  obtained  via  Monte 
Carlo  simulations,  the  one  exception  being  when  C  is  large  and  Pw|x(^(2)  I X)  is  small.  As  mentioned  in 
section  6.3.1,  assumption  6.1  —  on  which  the  approximations  of  (6.27)  and  (6.54)  are  predicated  —  fails  to 
hold  under  these  circumstances,  so  and  np  tend  to  be  under-estimated. 

Since  we  are  looking  for  a  greatest  lower  bound  on  n ,  above  which  each  learning  strategy  is  guaranteed 
to  achieve  its  goal,  we  would  prefer  estimators  of  and  np  that  are  positively  biased,  rather  than  negatively 
biased.  Moreover,  the  iterative  search  required  to  estimate  n^  and  np  numerically  is  computationally 
intensive.  This  motivates  us  to  derive  greatest  lower  bounds  on  n  using  classical  techniques. 


f>.4  Sample  Size  Bounds 
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6.4.1  A  Greatest  Lower  Bound  on  ha 

For  the  differential  learning  strategy,  we  want  to  know  the  value  of  ha  above  which  'he  discriminant 
differential  associated  with  the  most  likely  die  face  U7(|)  is  non-positive  with  probability  less 

than  d  =  I  —  a  .  This  is  a  “one-sided  tail”  probability,  which  can  be  bounded  from  above  by  a  two-sided 
tail  probability.  Using  short-hand  notation  for  P»v|x(^^(i)  I X) ,  Prv|x(^(/)  I X) ,  and  A w|x(^a7„i  I X) , 
the  bounding  inequalities  are 


P{<5a;,„(X)  <  0)  <  p(|.5a;,,,(X)  -  A(u;„))|  >  A(u;,„)) 


^  Var[5a;,,^(X)] 


(6.62) 

(6.63) 


The  inequality  in  (6.62)  represents  the  two-sided  upper  bound  on  the  one-sided  probability,  and  the  inequality 
in  (6.63)  is  an  application  of  Chebyshev’s  inequality  (e.g.,  [33,  pg.  219]). 

Since  A(U1(|))  =  P(U)(|))  -  P(U^{2)),  and  we  operate  under  assumption  6.1,  we  assume  = 

P(0J(i))  —  P(UI(2})-  Although  the  collective  empirical  face  probabilities  of  the  die  are  muUinomially 
distributed,  individual  face  probabilities  are  binomially  distributed  (i.e.,  in  any  given  cast,  the  face  Ulf/) 
turns  up  with  probability  P(C<;(,))  and  fails  to  turn  up  with  probability  P(->t(2(,))  =  1  —  P(C(2(,)) ).  Thus, 
the  variance  of  P(Cc'(,) )  is  given  by 

Var[P(a;(,)))  =  (5  54) 

We  make  the  invalid  but  simplifying  assumption  that  P(U7(|))  and  P{U7(2))  are  independent.  This  allows  us 
to  express  the  variance  of  by 


^  Var(P(a;(,))]  +  Var[P{u;(2))] 

_  P(<’<’^(l))  •  (I  ~  P(f<’^(i)))  +  P(^(2))  ■  (1  -  P(t^(2))) 

n 

Thus,  by  (6.62),  (6.63),  and  (6.65),  the  probability  that  the  discriminant  differential  will  be  non-positive  (i.e., 
that  the  most  likely  die  face  will  not  be  evident)  after  n  casts  of  the  die  is  bounded  from  above  as  follows: 


P(‘^cc;,„(X)  <  0)  < 


P(a;„))  ■  (I  -  P(u;(,)))  -i-  P(a;(2))  •  (i  -  P(a;(2))) 

n  •  (P(cr;(,))  -  P{a;a)))2 


(6.66) 
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If  we  wish  this  probability  to  be  no  more  than  </  =  I  —  rr ,  the  number  of  casts  riA  must  be  bounded  from 
below  as  follows: 


P(a;(,))  •  (I  -  P(a;,ii))  +  picj,;,)  •  (i  -  p(cU(2i)) 

(P{a;,„)  -  p{u;,2)))2 

> - - - - 


(6.67) 


It  is  straightforward  to  show  that  the  condition  of  (6.67)  is  equivalent  to  requiring  that  one  standard 
deviation  of  the  discriminant  differential  not  exceed  the  value  \/d  •  A(U)(|)) .  Thus,  if  we  want 

the  most  likely  die  face  to  be  evident  with  probability  at  least  «  =  I  —  r/  =  .95 ,  one  standard  deviation  of 
must  not  exceed  .224  •  A(tJ(|) ).  Equivalently, 


~nA[P(U)(|)),P(a;(2)),r/] 


^  20  ~  P(^(i)))  +  P(‘^(2))  •  (I  ~  P(<-*<^(2))) 

-  ^  (P{a;,„)  -  P(u;,2)))^ 


(6.68) 


Through  Monte  Carlo  simulations,  we  have  found  that  the  most  likely  die  face  is  evident  with  an  empirical 
probability  of  at  least  .95  if  (X)  does  not  exceed  |  A{C«2()) ) .  That  is,  C  in  (6.68)  can,  in  practice,  be 
reduced  to  9.  Appendix  J  tabulates  ~  «A[P(t<^(n).P(C«2(2)),d  =  .05],  given  C  =  9 ,  and  compares  this 
bound  with  empirical  values  of  «a  (generated  via  Monte  Carlo  simulations)  above  which  the  most  likely  die 
face  is  evident  in  95%  of  the  trials. 


6.4.2  A  Greatest  Lower  Bound  on  np 

A  greatest  lower  bound  on  the  sample  size  itp  necessary  to  guarantee  the  more  rigorous  goal  of  probabilistic 
learning  is  derived  in  a  similar  manner.  Of  course,  probabilistic  learning  implies  tighter  constraints  on 

the  variance  of  all  the  estimated  face  probabilities,  not  just  P(U2(|))  and  P{U2(2)).  As  a  result,  ~  tip  is 
substantially  larger  than  ~  ha  ,  reflecting  the  greater  information  requirements  of  probabilistic  learning. 

Let  us  consider  the  probability  of  a  single  die  face  turning  up  on  a  given  cast,  versus  the  probability 

of  any  other  die  face  turning  up.  As  mentioned  earlier,  the  estimated  probability  of  this  event  P(Ci)(,))  is 
binomially  distributed  when  considered  in  this  manner,  because  the  C-sided  die  reduces  to  an  unfair  coin 
with  face  probabilies  P(t<2(,))  and  P{-iCi)(,)).  In  this  context,  we  wish  to  know  the  upper  bound  on  the 

probability  that  the  estimate  will  deviate  from  the  true  probability  P(C(^(i))  amount  not  less 

than  c  •  ?{(*)(,•)).  Using  Chebyshev’s  inequality  once  more,  the  bound  is  given  by 


p(|p{a;(q)  -  P(u;<,))]  >  r  ■  P(a;(„)) 


^  VaT{P(u;„))l 

■  (€-P(t^(0))' 


1  -  P(U^(,)) 

n  •  •  P(u;(,)) 


(6.69) 
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Thus,  the  greatest  lower  bound  ~  np  on  the  number  of  die  casts  necessary  to  ensure  with  probability  at  least 
o  =  \  —  d  that  P(tc>(,))  does  not  deviate  from  P(Cc>(,t)  by  more  than  e  •  P(Cc’(,))  is 


~ /ip[P(6t;,,)).f.rf]  > 


1  -  P(a^(,)) 

d  €^-  P(a;,,)) 


(6.70) 


A  comparison  of  (6.70)  and  (6.67)  shows  that  the  probabilistic  bound  is  more  than  jr  times  the  differential 
bound  —  i.e.. 


^  •  nA[P(tc>(i)),P(a;(2)),d]  (6.71) 

—  unless  A(CJ(|))  is  small  in  (6.67).  Therefore,  if  we  wish  P(U^(,))  to  be  within,  say,  five  percent  of 
its  true  value,  the  number  of  die  casts  necessary  to  meet  this  goal  will  usually  be  at  least  400  times  the 
number  of  casts  needed  merely  to  identify  the  most  likely  die  face.  If  P(Ct7(,))  is  appreciably  smaller  than 
P(a;(|,)  and  P(Cci(2) )  1  *he  disparity  between  ~  np[P{U>(j) ) , e , d]  and  nA[P(t*^(i)).P(f*^(2)).<^  is  even 
greater  than  indicated  by  (6.7 1 ).  Thus,  the  bounds  on  ma  and  tn  (6.67)  and  (6.70)  quantify  the  assertions 
of  theorem  6.1  and  its  corollaries. 

By  (6.71),  the  training  sample  sizes  of  (6.67),  necessary  to  guarantee  a  specified  level  of  generalization 
via  differential  learning,  are  typically  orders  of  magnitude  smaller  than  those  of  (6.70),  necessary  to  estimate 
probabilities  with  a  specified  level  of  precision.  This  indicates  that  current  probabilistic  extensions  of  the  PAC 
learning  paradigm  [133]  to  stochastic  concepts  on  uncountable  feature  vector  domains  (e.g.,  [59,  60,  146]) 
are  likely  to  over-estimate  the  training  sample  sizes  necessary  for  good  generalization  when  the  learning 
objective  is  merely  pattern  classification. 

6.5  Extending  the  Rigged  Die  Paradigm  to  the  General  C-class  Leam- 
ing/Pattern  Recognition  Task 

It  is  straightforward  to  extend  this  chapter's  information-theoretic  paradigm  from  a  single  die  to  both 
countable  and  uncountable  feature  vector  spaces.  The  extension  to  the  countable  feature  vector  space  is  quite 
simple,  following  immediately  from  the  realization  that  a  single  die  represents  a  single  point  on  the  countable 
feature  vector  space  \ .  Thus,  we  move  from  a  paradigm  in  which  a  single  rigged  dice  game  is  played  to 
one  in  which  a  countably  finite  or  infinite  number  of  games  are  played.  The  choice  of  die  to  be  cast  is  itself 
modeled  as  an  unfair  die  with  P  sides,  corresponding  to  the  cardinality  |Xl  (where  P  may  be  infinitely 
large).  The  probabilities  associated  with  each  of  the  P  faces  reflect  the  probability  mass  function  of  the 
feature  vector  X.  From  this  point,  the  analysis  is  essentially  the  same  as  that  for  the  single  dice  game. 

The  extension  to  the  uncountable  feature  vector  space  follows  along  the  same  lines  as  that  for  the  countable 
space.  We  partition  the  uncountable  feature  vector  space  into  P  disjoint  “resolution  cells”  Xi . TCp 
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such  that 


p 

\JXp  =  X  (6.72) 

P=\ 

We  associate  some  nominal  pattern  (i.e.,  value  of  the  feature  vector)  Xp  with  each  Xp  •  ^nd  view  the  a 
pojrer/or/ class  probability  I  Xp)  as  the  expectation 


Pvv|x(tx;,|Xp)  =  I  P»v|x(t^,|X)  •  f)x(X)rfX  (6.73) 

Xp 

Through  this  artifice,  the  uncountable  space  looks  just  like  the  countable  one,  and  the  analysis  follows 
naturally.  As  the  number  of  die  casts  grows  large,  we  simply  allow  the  number  of  resolution  cells  to  grow 
until 


lim  Pvv|x(Cc;aXp)  =  P,v|x(t^.|X).  Xp  =  0  (6-74) 

P-^-X' 

...precisely  the  mechanism  we  employ  in  the  derivations  for  probabilistic  and  differential  learning  in 
chapter  2.  Thus,  the  mean  values  [  and  Ex  [nAlP(U^(n).P(<^(2)).</]l .  can  be 

derived  in  order  to  determine  the  expected  number  of  examples  of  X  needed  to  learn  the  most  likely  class 
(jJ.  =  CLt(i)  for  each  pattern  in  feature  vector  space.  Likewise,  Ex  can  be  derived  in  order  to 

determine  the  average  minimum  functional  complexity  necessary  for  probabilistic  learning. 

Our  objective  in  describing  the  procedure  by  which  the  die  paradigm  is  extended  to  the  general  feature 
vector  space  is  not  so  much  to  do  actual  modeling  or  sample  complexity  computations  (see  [8]  and  [146] 
for  lovely,  probabilistically  motivated  sample  complexity  analyses  along  these  lines)  as  it  is  to  point  out 
that  theorem  6.1  and  its  corollaries  hold  for  the  general  feature  vector  space  as  well.  There  is  at  least  one 
important  restriction,  however.  The  information-theoretic  analysis  of  this  chapter  operates  under  an  agnostic 
assumption.  In  terms  of  dice,  the  assumption  holds  that  information  regarding  one  die  conveys  nothing 
about  any  other  die.  In  terms  of  feature  vector  space,  the  assumption  holds  that  information  regarding  the 
probabilistic  nature  of  the  feature  vector  at  one  point  on  X  conveys  nothing  about  the  the  probabilistic 
nature  of  the  feature  vector  at  any  other  point  on  X  •  Clearly,  feature  vectors  for  which  a  proper  parametric 
model  exists  violate  the  agnostic  assumption,  in  that  information  regarding  the  probabilistic  nature  of  the 
feature  vector  at  one  point  in  X  conveys  information  about  the  the  probabilistic  nature  of  the  feature  vector 
at  all  points  in  X  •  lender  these  gnostic  conditions,  the  information-theoretic  predictions  of  the  sample  sizes 
necessary  to  characterize  the  feature  vector  (either  probabilistically  or  differentially)  will  be  pessimistic  (i.e., 
excessive,  perhaps  by  orders  of  magnitude).  Moreover,  corollary  6.2,  which  asserts  that  differential  learning 
requires  the  smallest  number  of  die  casts  to  determine  the  most  likely  face,  will  not  always  generalize 


6.6  Suninmn- 


181 


to  the  statement  that  differential  learning  requires  the  smallest  sample  size  necessary  to  yield  Bayesian 
discrimination.  We  remind  the  reader  that  the  gnostic  condition  (i.e.,  the  case  in  which  the  proper  parametric 
model  of  the  feature  vector  exists)  and  its  relationship  to  differential  and  probabilistic  learning  strategies  are 
treated  extensively  in  chapters  3  and  4. 

The  minimum-complexity  requirements  of  differential  learning  do  not  depend  on  the  existence  of  a 
proper  parametric  model,  but  hold  in  ail  cases.  This  trait  ensures  that  learning  can  be  done  with  the  simplest 
model  possible,  which  in  turn  ensures  that  the  model  will  generalize  well,  independent  of  the  feature  vector’s 
probabilistic  nature  (section  3.5). 

6.6  Summary 

The  rigged  game  of  dice  lies  at  the  heart  of  all  statistical  pattern  recognition  tasks.  By  analysing  the 
requirements  for  identifying  the  most  likely  face  of  the  unfair  die,  we  derive  information-theoretic  proofs  that 
correspond  to  the  estimation  and  set-theoretic  proofs  of  chapter  3.  Those  proofs  establish  the  asymptotically 
efTiciency  of  differential  learning,  as  well  as  its  minimal  classiHer  complexity  requirements.  The  proof  that 
differential  learning  requires  the  fewest  casts  of  the  die  to  identify  the  most  likely  die  face  does  not  extend  to 
a  blanket  assertion  that  differential  learning  is  efficient  for  both  small  and  large  training  sample  sizes,  since 
probabilistic  learning  can  be  more  efficient  for  small  training  samples  when  paired  with  a  proper  parametric 
model  (iccall  section  Z.C).  However,  the  pi  oof  does  confirm  the  assertion  that  differential  learning  is  efficient 
for  small  as  well  as  large  training  sample  sizes  when  the  hypothesis  class  is  an  improper  parametric  model. 
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Chapter  7 

Implementing  Differential  Learning 


Outline 

We  describe  the  pragmatic  issues  that  arise  when  one  implements  differential  learning  via  the  CFM  objective 
function.  Specifically,  we  discuss  regulation  of  the  CFM  confidence  parameter,  the  role  of  CFM  and 
confidence  in  accepting  or  rejecting  classifications,  issues  of  representation,  and  discriminator  complexity. 
We  use  the  Iris  data  collected  by  E.  Anderson,  and  subsequently  used  by  R.  A.  Fisher  in  his  celebrated  paper 
on  linear  discriminants  [34]  to  illustrate  the  importance  of  these  issues  and  to  describe  practical  means  of 
addressing  them. 

7.1  Introduction 

Part  I  describes  the  theory  of  differential  learning,  but  it  does  not  discuss  the  details  of  implementing  the 
theory.  The  two  bodies  of  knowledge  are  linked,  but  there  is  a  point  at  which  scientific  rigor  inevitably  gives 
way  to  practical  considerations.  This  chapter  discusses  such  considerations,  and  serves  as  a  link  between  the 
theory  of  part  I  and  the  applications  of  that  theory  in  the  chapters  that  follow. 

We  describe  three  hypothesis  classes  that  we  use  in  the  experiments  of  this  chapter  and  all  that  follow. 
We  then  describe  the  Iris  data  collected  by  E.  Anderson  [3]  and  subsequently  used  by  R.  A.  Fisher  in  his 
1936  paper  on  linear  discriminants  [34].  We  show  that  a  linear  classifier  can  learn  all  but  two  of  the  ISO  Iris 
examples.  We  then  use  this  learning  task  to  illustrate  the  following  practical  considerations  of  differential 
learning: 

•  How  the  confidence  parameter  0  affects  learning,  and  how  it  can  be  regulated  during  learning  to 
control  the  level  of  detail  to  be  learned  from  the  feature  vector  X . 

•  How  t/’  is  practically  related  to  subsequent  acceptance  or  rejection  of  test  example  classifications. 
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•  How  one’s  representational  choice  (i  e.,  one’s  a  priori  choice  of  hypothesis  class)  affects  differential 
learning. 

•  The  relationship  between  low  classifier  complexity  and  efficient  learning. 

Throughout  this  chapter  and  most  of  those  that  follow,  we  contrast  differential  learning  with  probabilistic 
learning  under  controlled  experimental  conditions. 

7.2  Three  Hypothesis  Classes 

In  this  and  the  remaining  chapters,  we  employ  classifiers  drawn  from  three  hypothesis  classes,  corresponding 
to  three  functional  bases.  Different  functional  bases  yield  different  representations  or  models  of  the  data.  We 
describe  these  hypothesis  classes  in  the  following  three  sections. 

7.2.1  The  Linear  Hypothesis  Class 

The  fth  discriminant  function  of  a  discriminator  belonging  to  the  linear  hypothesis  class  is  given  by 

g,(X|fl)  =  X'^0i,  (7.1) 

wherj  the  notation  denotes  the  transpose  of  vector  z,  X  is  the  A(-dimensional  feature  vector,  and  X'  is 
the  (A(  +  1  )-dimensional  augmented  feature  vector  formed  by  prepending  a  single  element  of  unit  value  to 
X(e.g.,  (29.  pp.  136-137]): 


X'  ^ 


(7.2) 


The  parameter  vector  for  the  ith  discriminant  function  is  part  of  the  over-all  parameter  vector  for  the 
discriminator:  9,  C  6  &  0\9i  e  .  We  refer  to  the  classifier  with  such  a  discriminator  as  a //near 
classifier  because  it  forms  piece-wise  linear  class  boundaries  on  ^  • 


7.2.2  The  Logistic  Linear  Hypothesis  Class 

The  rth  discriminant  function  of  a  discriminator  belonging  to  the  logistic  linear  hypothesis  class  is  given  by 

g,(X|9)  =  [l  -I-  exp  (-X'^ «,)]"'  .  (7.3) 

logistic  function  of  X'^  0, 

where  X ,  X' ,  and  0,  are  as  described  above.  We  refer  to  the  classifier  with  such  a  discriminator  as  a 
logistic  linear  classifier  because  it  employs  logistic  discriminant  functions,  yet  it  forms  piece-wise  linear 


7.3  Learning  Irises 


187 


class  boundaries  on  "X,  ■  This  latter  characteristic  is  clear  in  the  solution  of  the  (jJilLJj  boundary  equation, 
given  the  rth  and  ^th  discriminant  functions;  the  solution  is  linear  in  X; 

gi{x\e)  =  g,(X|e)  iff  x'^  [Oi  -  e;\  =  o  (7.4) 

Note  that  when  the  discriminator  is  formed  by  cascaded  layers  of  logistic  functions  it  constitutes  a  multi-layer 
perceptron  (e.g.,  [  1 20]),  and  the  resulting  class  boundaries  it  forms  on  X  non-linear  in  X. 

The  Kullback-Leibler-generated  logistic  linear  classifier;  When  generated  probabilis¬ 

tically  via  the  CE  objective  fiotction,  the  logistic  linear  classifier  is  identically  the  lo¬ 
gistic  discriminant  analysis  model  (i.e.,  the  logistic  regression  model  used  for  classifica¬ 
tion).  See  appendix  F  for  proofs  of  this  assertion,  which  originate  with  White  and  Hjort. 


7.2.3  The  Gaussian  Radial  Basis  Hypothesis  Class 

The  <th  discriminant  function  of  a  discriminator  belonging  to  the  Gaussian  Radial  Basis  hypothesis  class  is 
given  by 


I 

(iTT)? 


(X  -  M.y  (X  -  M.) 


(7.5) 


where  N  denotes  the  dimensionality  of  the  feature  vector,  and  the  ith  mean  fSj  and  covariance  matrix  E,  are 
subsets  of  the  over-all  discriminator  parameter  vector  0 .  We  refer  to  the  classifier  with  such  a  discriminator 
as  an  RBF  classifier,  recognizing  that  it  constitutes  a  Gaussian  radial  basis  function  (RBF)  classifier  (e.g., 
[18,  95,  104,  92])  when  the  discriminator  is  formed  by  cascaded  layers  of  these  functions.  We  also  use  a 
modified  form  of  RBF  classifier  described  in  appendix  K;  we  refer  to  the  differentially-generated  variants  as 
Differential  Radial  Basis  Function  (DRBF)  classifiers. 


7.3  Learning  to  Identify  the  Irises  of  the  Gaspe  Peninsula ' 

Fisher’s  Iris  data  is  a  well-known  database  consisting  of  four  physical  measurements  (petal  length  and 
width  and  sepal  length  and  width)  taken  from  150  Iris  specimens,  50  examples  from  each  of  three  species. 
E.  Anderson  collected  the  data  [3],  and  R.  A.  Fisher  subsequently  used  it  in  his  seminal  paper  on  linear 
discriminants  [34].  We,  in  turn,  use  the  data  to  illustrate  the  pragmatic  issues  of  differential  learning. 

Figure  7.1  shows  the  Iris  data  projected  onto  two  of  the  four  dimensions  of  feature  vector  space  X  •  TTie 
figure  is  based  on  figure  6. 1 1  of  [29];  the  reader  will  note  differences  in  the  data  locations  between  figure  7. 1 

'  We  thank  Professor  Casimir  Kultkowski  of  Ringers  University  for  providing  us  with  an  electronic  version  of  Anderson/Fisher’s 
original  Iris  data. 
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and  its  progenitor;  this  is  because  Duda  and  Hart  depicted  multiple  examples  having  the  same  petal  length 
and  width  by  perturbing  the  data  points  in  their  diagram  —  a  procedure  that  we  omit. 

The  empirical  class  distributions  have  a  strong  correlation  with  the  histograms  Fisher  generated  from  his 
linear  discriminant  function  (cf.  figure  I,  [34]).  Iris  setosa  is  easily  distinguishable  from  the  other  two  species 
(we  will  use  the  terms  “class”  and  “species”  synonymously  throughout  the  remainder  of  this  chapter). 
Iris  versicolor  (Ct>2 )  and  Iris  virginica  (CJj )  have  empirical  distributions  that  overlap  to  some  degree,  as 
projected  onto  this  two-dimensional  sub-space  of  X  •  region  of  overlap  is  depicted  by  the  blue-to-red 
color  bar  underneath  the  examples.  Boundary  ^2,3  separates  the  empirical  distributions  of  UJi  and  LO-t, 
with  a  relatively  small  number  of  errors.  The  color  bar  underneath  is  a  means  of  encoding  the  position  of 
the  superimposed  examples  relative  to  boundary  on  the  2-dimensional  (petal  length  and  petal  width) 
sub-space  of  examples  in  the  dark  red  region  are  well  into  the  UJy  side  of  the  boundary;  examples  in  the 
dark  blue  region  are  well  into  the  CJ2  side  of  the  boundary. 

In  reality,  62,3  is  the  1 -dimensional  pro/echon  of  a  hypothetical  3-dimensionaI  boundary  in  X 
The  color  bar  is  the  graphical  means  by  which  we  transform  the  two-dimensional  sub-space  of  figure  7. 1  into 
a  single  dimension  perpendicular  to  the  projection  of  62,3 .  We  encode  position  along  this  real  dimension  by 
color  and  intensity;  examples  superimposed  on  increasingly  red  portions  of  the  color  bar  have  increasingly 
positive  values  (i.e.,  they  are  more  to  the  right  of  62,3 );  those  013  increasingly  blue  portions  of  the  color  bar 
have  increasingly  negative  values  (i.e.,  they  are  more  to  the  left  of  ^2,3 ).  Figure  7.2  shows  all  the  confusable 
examples  (i.e.,  all  those  in  the  vicinity  of  ^2,3  in  figure  7. 1  and,  as  a  result,  superimposed  on  the  color  bar)  in 
the  disjoint  2-dimensional  sub-space  of  X  comprising  the  sepal  length  and  width  features.  The  true  class  of 
each  example  in  figure  7.2  is  indicated  by  its  shape.  The  petal  length  and  width  of  each  example  in  figure  7.2 
is  indicated  by  the  color/intensity  of  its  shape,  which  denotes  the  position  of  the  example  with  respect  to  the 
projection  of  boundary  in  figure  7. 1 . 

The  projection  of  boundary  onto  sepal  length/width  space  in  figure  7.2  obviously  depends  on  the 
values  of  petal  length  and  width.  After  some  study  of  figures  7. 1  and  7.2,  it  should  be  clear  that  a  linear 
classifier  will  produce  the  fewest  errors  if  the  projection  of  Bj^y  onto  sepal  length/width  space  is  Bz^sa 
for  values  of  petal  length  and  width  corresponding  to  the  blue  region  of  figure  7.1.  As  we  transition  from 
this  region  of  feature  space  to  the  one  corresponding  to  the  red  region  of  figure  7.1,  the  projection  of  Bz^y 
onto  sepal  iength/width  space  in  figure  7.2  transitions  from  Bt,yA  to  Bz,yB  •  As  a  first  approximation  to  the 
full  3-dimensional  projection  of  boundary  Bz,y ,  we  can  imagine  that  boundary  Bz,yA  in  figure  7.2  applies 
to  all  blue-colored  examples  (i.e.,  all  those  to  the  left  of  Bz,y  in  figure  7.1),  and  boundary  Bz,yB  applies 
to  all  red-colored  examples  (i.e.,  all  those  to  the  right  of  Bz,y  in  figure  7. 1 ).  A  linear  classifier  with  such  a 
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Petal  Length  (cm) 


Classify  as  O)] 


Figure  7.1;  Two  of  the  four  features  (petal  length  and  width)  E.  Anderson  measured  on  the  Irises  of  the 
Gaspe  Peninsula  [3|  (see  appendix  L).  This  figure  is  based  on  figure  6.1 1  of  Duda  &  Hart  126).  With  these 
two  features  alone.  Iris  setosa  are  clearly  distinguishable  from  the  two  other  species  (i.e.,  classes).  Class 
boundaries  Bi,2  and  132,3  separate  the  classes  with  relatively  few  errors.  The  blue/red  color  bar  denotes  the 
position  of  the  superimposed  examples  relative  to  boundary  B2,3 .  The  color  blue  denotes  the  UJ2  side  of 
the  boundary;  red  denotes  the  UJ3  side  of  the  boundary;  the  more  intense  the  color,  the  farther  the  Euclidean 
distance  a  superimposed  example  is  from  the  boundary.  The  examples  superimposed  on  the  red/blue  graphic 
are  the  confiisable  examples  of  Iris  versicolor  ( U72 )  and  Iris  virginica  {UJ3)  because  they  straddle  the  optimal 
linear  boundary  between  these  two  classes.  These  confiisable  examples  are  shown  in  figure  7.2.  Examples  83 
and  133  (circl^)  cannot  be  learned  by  a  linear  classifier. 


190 


Chapter  7:  Implementing  Differential  Learning 


Sepal  Length  (cm) 


Figure  7.2:  The  confusable  examples  of  figure  7.1,  plotted  as  a  function  of  the  other  two  features  (sepal 
length  and  width).  Each  example’s  shape  denotes  its  true  class.  The  color  and  intensity  of  the  shape  denote 
the  example’s  position  relative  to  boundary  in  figure  7.1,  in  accordance  with  the  blue/red  color  bar  of 
that  Ogure.  If  the  example  is  blue  (i.e.,  falls  to  the  left  of  boundary  in  figure  7.1),  boundary  Bz^u 
applies  in  this  figure.  If  the  example  is  red  (i.e.,  falls  to  the  right  of  boundary  Bz,i  in  figure  7.1),  boundary 
Bz,3b  applies  in  this  figure.  This  linear  model  correctly  classifies  all  but  two  of  the  150  examples;  it  cannot 
learn  examples  83  and  1 33.  These  two  examples  are  circled  in  figure  7. 1 ;  note  that  they  are  the  two  closest 
examples  to  boundary  Bz,3  in  that  figure. 


boundary  will  misclassify  examples  83  and  133,^  which  lie  very  close  to  the  projection  of  Bz,i  in  figure  7.1 
(they  are  the  circled  examples  in  that  figure). 

The  reader  should  note  that  boundary  Bz,i  and  its  projections  are  not  unique,  but  represent  a  nominal 
form  of  the  minimum-error  linear  boundary  between  Iris  classes  UJz  and  Uz .  Thus,  we  would  expect  an 
efficient  learning  algorithm  to  produce  a  linear  classifier  that  correctly  identifies  all  Iris  examples  except 
numbers  83  and  133.  We  illustrate  the  implementational  issues  of  differential  learning  by  showing  that  it 

^We  use  the  indeces  0  ->  149  for  the  I  SO  examples  in  the  the  database  (see  appendix  L).OUier  authors  use  the  indeces  I  -<•  150. 
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produces  just  such  a  linear  classifier. 

7.4  Controlling  the  Confldence  Parameter 

Section  D.4  proves  that  the  CFM  confidence  parameter  i!'  must  be  proportional  to  Piv|x(t^«  |  X) ,  the  largest 
a  posteriori  class  probability  for  X,  in  order  for  the  classifier  to  learn  the  Bayes-optimal  (i.e.,  most  likely) 
class  UJ.  (recall  definition  2.1,  page  17).  Indeed,  as  Pw|x(^>  |X)  and/or  the  associated  discriminant 
differential^  5.(X|®*)  decrease  in  the  vicinity  of  the  class  boundaries,  V’  must,  by  (2.102)  and  (2.104),  be 
decreased  in  order  to  learn  UJ. .  As  stated  in  sections  5.3.6  and  5.4,  the  relationship  between  small  values 
of  Pw|x(^‘7.  I X)  and/or  5.(X  1 0‘)  and  small  values  of  0  accounts  for  our  use  of  the  term  “confidence” 
parameter  for  0 .  If  0  must  be  small  to  learn  a  training  example  ,  then  we  should  literally  have  low 
confidence  that  its  associated  class  label  is  the  Bayes-optimal  class  U). . 

In  fact,  the  effect  of  0  on  learning  is  not  local,  as  one  might  infer  from  sections  2.4  and  D.4,  rather  it  is 
global  (i.e.,  its  value  for  one  example  affects  the  learning  of  all  other  examples).  One’s  a  priori  choice  of 
hypothesis  class  bounds  the  classifier’s  functional  complexity,  and  one  can  think  of  the  differential  learning 
procedure  simply  as  a  means  of  allocating  that  complexity  in  such  a  way  that  CFM  is  maximized.  Complexity 
is  allocated  proportional  to  the  confidence  associated  with  each  training  example,  so  a  fixed  value  of  0  for 
all  training  examples  determines  which  examples  can  be  learned  and  which,  if  any,  cannot  be  learned,  given 
the  hypothesis  class.  If  the  training  sample  size  is  large,  then  we  are  justified  in  learning  all  examples  — 
even  those  in  which  we  have  relatively  low  confidence  —  since  we  assume  that  the  sample  is  representative 
of  the  underlying  probabilistic  nature  of  X .  If  on  the  other  hand  the  training  sample  is  small,  we  are  unwise 
to  learn  examples  in  which  we  have  low  confidence,  since  they  may  not  be  representative  of  X .  Instead  we 
would  be  wise  to  learn  only  those  examples  in  which  we  have  relatively  high  confidence. 

The  Iris  data  in  figures  7.1  and  7.2  illustrate  that  U'aining  samples  usually  contain  both  “easy”  examples 
(i.e.,  ones  that  are  easily  classified)  and  “hard”  examples  (i.e.,  ones  that  aren’t  so  easily  classified).  Recall 
our  definitions  of  easy  and  hard  examples  in  section  5.4.  Probabilistically,  the  easy  example  is  found  far 
from  the  Bayes-optimal  class  boundaries  on  “X .  near  a  mode  of  its  class-conditional  pdf;  its  a  posteriori 
class  probability  Pw|x(^«  I X)  and  the  associated  discriminant  differential  <$.(X  1 0*)  are  therefore  large, 
allowing  it  to  be  learned  with  high  confidence  0 .  The  hard  example  is  found  in  the  vicinity  of  the  class 
boundaries  on  ^ ,  in  a  “tail”  of  its  class-conditional  pdf.  In  these  tails  Pw|x(C<(7>  |  X)  and/or  ($.(X  1 9‘) 

’Recall  ftom  section  2.4  that  6.(X|0*)  denotes  the  discriminant  differential  jr*(X|0*)  -  maxt^.  g.(X|tf*),  where  the 
subscript  *  denotes  the  index  of  the  most  likely  class  Ct/. .  The  notation  also  indicates  that  the  discriminant  differential  is  the  one 
generated  by  the  discriminator  with  the  CFM-maximizing  parameterization  0* .  Throughout  the  present  discussion,  we  assume  that 
the  discriminator  possesses  sufHcient  functional  complexity  to  learn  the  Bayes-optimal  classifier  of  X.  As  a  result,  we  assume  that 
^.(Xltf*)  is  positive,  as  long  as  0  is  sufficiently  small. 
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are  relatively  small,  so  the  hard  example  can  be  learned  only  with  low  confidence  if  it  can  be  learned  at  all/ 
The  challenge,  therefore,  is  to  develop  a  learning  procedure  that  learns  easy  examples  with  high  confidence 
and  hard  ones  with  low  confidence,  allocating  the  functional  complexity  of  the  classifier  in  commensurate 
fashion. 

Figure  7.3  illustrates  the  output  state  of  a  15-parameter^  logistic  linear  classifier  projected  onto  reduced 
discriminator  output  space  (definition  5.2,  page  1 16).  This  is  the  reduced  output  state  after  the  classifier 
has  attempt  to  learn  all  150  examples  of  the  Iris  data  with  V’  =  I  (^^0  epochs).  Reduced  discriminator 
output  space  is  shown  in  the  upper  right  of  the  figure,  and  the  projection  of  the  CFM  objective  function  onto 
the  reduced  discriminant  continuum  (i.e.,  the  domain  of  6  =  yr  —  Jr )  is  shown  in  the  lower  left  of  the 
figure.  The  discriminant  differentials  for  most  of  the  training  examples  are  large,  corresponding  to  output 
states  that  are  nearly  binary.  Two  of  the  nine  examples  that  are  not  learned  have  relatively  laigc  discriminant 
differentials,  indicating  that  the  classifier  is  relatively  confident  in  its  incorrect  classification.  The  remaining 
seven  misclassifications  engender  very  small  discriminant  differentials  (these  examples  appear  in  the  lower 
left  and  upper  right  comer  of  reduced  discriminator  output  space  in  figure  7.3).  Recall  from  section  2.4  that 
the  “linear”  form  of  CFM  (associated  with  the  highest  confidence  value  of  unity)  cannot  learn  examples  for 
which  Pw|x(^*  I X)  <  5  :  the  discriminant  differential  that  maximizes  CFM  Is  zero  in  these  cases.  The 
empirical  a  posteriori  class  probabilities  of  W2  and  ^3  are  approximately  5  in  the  vicinity  of  ^2,3 .  which 
accounts  for  the  tiny  discriminant  differentials  exhibited  by  seven  of  the  nine  misclassified  examples.  Most 
of  the  remaining  examples  are  classified  with  high  confidence,  as  indicated  by  the  large  positive  discriminant 
differentials  they  engender. 

Figure  7.4  shows  the  same  classifier  after  learning  with  V'  =  0-6  (350  epochs).  Three  examples  (70,  83, 
and  1 33)  are  un-leamable  at  this  level  of  confidence.  Note  that  no  examples  generate  binary  output  states: 
the  largest  discriminant  differential  is  approximately  0.7.  Likewise,  no  learned  or  transition  examples  exhibit 
discriminant  differentials  less  than  0.3,  in  contrast  with  figure  7.4.  The  un-leamed  examples  all  exhibit 
discriminant  differentials  in  the  vicinity  of  -0.4.  By  reducing  the  confidence  with  which  the  classifier  learns, 
we  allow  it  to  allocate  its  functional  complexity  in  such  a  way  that  it  learns  more  of  the  hard  examples. 

Figure  7.5  shows  the  effect  of  differential  learning  over  350  epochs  when  the  confidence  parameter  is 
gradually  decreased  from  a  starting  value  of  0.6  at  epoch  0  to  a  final  value  of  0.1  beyond  epoch  200.  The 
“gradual”  reduction  is  linear  (i.e.,  V’  is  reduced  by  at  the  end  of  each  epoch,  beginning  with  epoch 

0,  and  ending  with  epoch  200).  We  find  that  this  form  of  scheduled  confidence  reduction  allows  the  classifier 
to  learn  the  easy  examples  with  high  confidence  and  the  hard  ones  with  (necessarily)  lower  confidence.  After 
350  epochs,  only  examples  83  and  133  remain  un-leamed,  as  our  analysis  in  section  7.3  predicts  for  an 

^For  stochastic  feature  vectors  with  overlapping  class-conditional  pdfs,  the  Bayes  error  rate  is  non-zero:  some  example/class  label 
pairs  are  inevitably  un-leamable. 

’There  are  C  =  3  discriminant  functions,  and  the  augmented  feature  vector  has  W  -I-  I  =5  elements.  Therefore  the  classifier 
has  3  ■  S  =  IS  parameters. 
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optimally-generated  linear  classifier.  The  largest  discriminant  differential  exhibited  by  a  learned  example  is 
now  approximately  0.68;  the  minimum  discriminant  differential  is  approximately  0.07  for  the  few  learned 
examples  that  fall  near  the  reduced  discriminant  boundary  (definition  S.5).  Figure  7.6  compares  histograms 
of  the  classifier  outputs  corresponding  to  figures  7.4  and  7.5  respectively.  For  V’  =  0.6,  outputs  yi  and  ys 
are  binary  for  most  examples;  output  >’2  is  approximately  normally  distributed  about  a  mean  of  0.45.  When 
V'  is  reduced  to  0. 1  in  the  scheduled  manner  described  above,  the  distribution  of  >2  becomes  bimodal  and 
outputs  yi  and  become  noticeably  less  binary.  These  changes  are  sufficient  to  learn  example  70  (cf. 
figure  7.4  versus  7.5),  one  of  the  three  un-leamable  examples  for  V’  =  0-6  and,  by  its  proximity  to  Bz,-}, 
and  B2,sb  in  figures  7. 1  and  7.2,  one  of  the  hardest  examples  that  a  linear  classifier  can  learn. 

Differential  learning  with  high  confidence  focuses  on  the  easy  examples  because  they  have  a  large 
a  posteriori  probability  1^)  associated  with  the  Bayes-optimal  class  UJ.  and  they  generate  a 

large  discriminant  differential  ^.(X 1 9")  (again,  recall  the  relationships  of  (2.102)  and  (2.104) ).  In  turn, 
Pw|x(^*|^)  and  are  large  where  the  associated  class-conditional  pdf 

—  that  is,  around  its  mode(s).  As  learning  confidence  is  reduced,  focus  shifts  from  the  modes  of  the  training 
sample’s  empirical  class-conditional  distributions  (i.e.,  the  easy  examples)  to  the  tails  (i.e.,  the  hard  examples 
in  the  vicinity  of  the  class  boundaries)  —  a  phenomenon  illustrated  in  figures  7.4  —  7.6.  In  this  sense,  one 
can  think  of  the  ’’outlier”  examples  of  a  given  class  as  ones  containing  the  fine  details  (encoded  by  X )  that 
distinguish  one  class  from  another.  By  beginning  with  the  easy  examples  and  gradually  proceeding  to  the 
hard  ones,  differential  learning  with  scheduled  confidence  reduction  first  learns  the  gross  characteristics  of 
each  class  and  then  focuses  on  the  details  that  distinguish  each  class  from  all  the  others. 


7.5  Focussing  on  the  Un-Learned  Examples 

Scheduled  reduction  of  V'  has  added  benefits 

•  As  V'  0'*'  and  the  synthetic  CFM  sigmoid  becomes  steeper,  more  training  examples  fall  into 
the  ’’learned”  category  (i.e.,  they  exhibit  positive  discriminant  differentials  that  are  large  enough  to 
generate  the  maximum  CFM  value  of  unity).  Since  the  derivative  of  the  synthetic  CFM  objective 
function  is  zero  for  learned  examples,  these  examples  have  no  effect  on  learning. 

•  Since  the  synthetic  CFM  derivative  is  zero  for  learned  examples,  the  learning  procedure  can  skip  the 
parameter  adjustment  phase  associated  with  learned  examples.  For  example,  a  differentially-generated 
neural  network  classifier  that  uses  backpropagation  need  not  backpropagate  on  the  learned  examples. 
As  learning  proceeds  and  most  examples  become  learned,  this  results  in  substantial  computational 
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Figure  7.3;  The  IS-parameter  differential  logistic  linear  classifier’s  output  state  — as  projected  onto  the 
reduced  discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  data  with  high  confidence.  Recall 
from  chapter  5  that  denotes  the  classifier  output  corresponding  to  the  correct  class  for  each  example,  and 
yr  denotes  the  largest  o/Aer  classifier  output.  The  confidence  parameter  of  1.0  (set  prior  to  learning)  results 
in  a  nearly  linear  form  of  the  CFM  objective  function  (lower  left),  which  tends  to  engender  binary  output 
states  in  the  classifier  (most  examples  ^pear  in  the  lower  right  comer  of  the  display).  The  classifier  cannot 
learn  9  of  the  I  SO  examples  with  &is  hign  level  of  confidence.  Seven  of  these  9  un-leamed  examples  occur 
at  (yr  a  0,y;:  “  0)  or  (yv  e  I  ,y7  S  1). 
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Figure  7.4:  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn  the 
Iris  data  with  moderate  confidence.  The  confidence  parameter  of  0.6  allows  the  classifier  to  learn  all  but 
three  of  the  ISO  examples.  Note  that  the  output  state  of  the  classifier  is  no  longer  binary. 


Figure  7.5:  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn  the 
Iris  data  with  low  confidence.  The  confidence  parameter  of  0.1  allows  the  classifier  to  learn  all  but  two  of 
the  ISO  examples.  Across  many  independent  trials,  the  two  un-leamable  examples  are  consistently  83  and 
1 33:  these  are  the  examples  shown  to  be  un-leamable  by  a  linear  classifier  in  figure  7.2. 


7.5  Un-Leanied  Examples 


197 


A<  f  ^  Outl'Ul  I  >vci  Ni  (!e  ?  r' 


HlUiqt  jm  ol  Af jlior*<  tew  Oulr*o*  Uosw  3  r*' 


Figure  7.6;  Left:  Histograms  of  the  output  states  for  the  classifier  in  figure  7.4  after  350  learning  epochs: 
il'  =  0.6.  These  histograms  correspond  to  the  reduced  discriminator  output  state  in  figure  7.4.  Right: 
Histograms  of  the  output  states  for  the  same  classifier  (figure  7.5)  after  350  learning  epochs;  t/'  is  reduced 
from  0.6  to  0. 1  over  the  first  200  learning  epochs.  These  histograms  correspond  to  the  reduced  discriminator 
output  state  in  figure  7.5. 


savings.* 


In  short,  differential  learning  with  scheduled  confidence  redurtirr  focuses  on  learning  the  un-leamed 
examples. 

Figure  7.7  illustrates  that  the  easy  learning  proceeds  relatively  quickly,  and  the  hard  learning  proceeds 
relatively  slowly.^  The  figure  shows  the  learning  curve  for  the  classifier  with  scheduled  confidence  reduction, 

^HafTneret  at  employ  an  analogous  form  of  focussed  probabilistic  learning  in  [48J.  They  use  the  mean-squaied-enor(MSE)  objective 
function  and  ignore  training  examples  that  generate  MSE  less  than  a  human-specined  threshold  value.  Since  there  is  no  monotonic 
relationship  between  an  example's  MSE  and  whether  or  not  it  is  correctly  classified,  the  method  does  not  necessarily  focus  on  un-leamed 
examples:  rather,  it  focuses  on  examples  with  relatively  high  MSE.  Neveitheless,  the  motivation  for  the  technique  is  the  computational 
savings  it  produces  —  a  motivation  that  we  both  acknowledge  and  share. 

’Please  see  section  8.2  for  a  description  of  the  experimental  protocols  we  employ  throughout  this  text  when  comparing  differentially- 
generated  classifiers  with  probabilistically-generated  controls. 


Figure  7.7:  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter  logistic 
linear  classifier  shown  in  figure  7.5  as  it  learns  differentially  (CFM).  The  classifier’s  empirical  error  rate  after 
350  learning  epochs  is  1 .3  (+2.5/- 1 .3)%. 


shown  in  figure  7.5.  The  dark  gray  curve  shows  the  classifier’s  training  sample  empirical  error  rate  as  learning 
progresses;  95%  confidence  bounds  are  superimposed  on  the  curve  periodically.  The  light  gray  background 
shows  the  associated  value  of  CFM  as  learning  proceeds.  The  classifier  learns  to  distinguish  the  members  of 
lO\  from  members  of  the  other  two  classes  in  fewer  than  five  epochs.  By  75  epochs  the  classifier  has  learned 
all  but  nine  of  the  hard  examples;  it  then  requires  approximately  220  epochs  to  learn  all  but  two  of  those  nine 
hard  examples.  Owing  to  the  computational  savings  associated  with  learned  examples,  the  actual  number  of 
computations  per  epoch  decreases  proportional  to  the  fraction  of  learned  examples  in  the  training  sample.  As 
a  result,  the  last  ISO  learning  epochs  (associated  with  the  minimum  confidence  value  of  0.1)  proceed  more 
rapidly  than  the  early  learning  epochs. 

Figure  7.8  shows  the  learning  curve  for  the  logistic  linear  classifier  that  employs  probabilistic  learning 
via  the  MSE  objective  function.  It  learns  the  easy  examples  faster,  and  the  hard  ones  more  slowly  than  its 
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Figure  7.8:  The  empirical  errc  r  rales  (training  sample  with  all  150  examples)  for  the  15-parameter  logistic 
lines'  classifier  as  it  learns  probabilistically  (MSE)).  The  classifier’s  empirical  error  rate  after  350  learning 
epochs  is  2.0  (+2.9/-2.0)%. 


differentially-generated  counterpart  —  a  trend  that  we  find  common  across  a  wide  range  of  learning  tasks. 
Figure  7.9  shows  the  final  state  of  this  classifier  after  350  learning  epochs.  All  learning  conditions  are 
identical  to  those  for  the  differential  model,  except  that  the  MSE  objective  function  is  minimized  in  lieu  of 
maximizing  the  CFM  objective  function.  The  MSE-generated  classifier  cannot  learn  examples  70,  83,  and 
1 33.  Note  in  figure  7.9  that  the  easy  examples  (corresponding  to  the  modes  of  the  empirical  class-conditional 
example  distributions)  dominate  the  learning  procedure:  yr .  the  output  representing  the  true  class  of  a  given 
training  example,  is  fluently  unity,  and  y^,  the  largest  other  output,  is  generally  less  than  0.5.  The  harder 
examples,  corresponding  to  the  tails  of  the  empirical  class-conditional  example  distributions,  tend  to  cluster 
along  contours  of  constant  MSE.  Many  of  these  examples  exhibit  zero  values  of  yv  (■•«••  they  appear  all  along 
the  line  yv  =  0  in  figure  7.9).  This  output  state  shows  that  probabilistic  learning  engenders  classifier  outputs 
that  approximate  the  empirical  a  posteriori  class  probabilities  of  X  <o  the  degree  of  precision  allowed  by  the 


200 


Chapter  7;  Implementing  Differential  Learning 


Figure  7.9:  The  15-parameter  logistic  linear  classifier’s  output  state  — as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (MSE  —  see  figure  7.8). 


hypothesis  class. 

Figure  7.10  shows  the  learning  curve  for  the  logistic  linear  classifier  that  employs  probabilistic  learning 
via  the  Kullback-Leibler  information  distance  (CE  objective  function).  Like  the  MSE-generated  classifier,  it 
learns  the  easy  examples  faster,  and  the  hard  ones  more  slowly  than  its  differentially-generated  counterpart. 
Figure  7.1 1  shows  the  final  state  of  this  classifier  after  350  learning  epochs.  All  learning  conditions  are 
identical  to  those  for  the  differential  model,  except  that  the  CE  objective  function  is  minimized  in  lieu  of 
maximizing  the  CFM  objective  function.  Like  the  MSE-generated  model,  the  CE-generated  classifier  cannot 
learn  examples  70, 83,  and  1 33.  Note  again,  the  easy  examples  (corresponding  to  the  modes  of  the  empirical 
class-conditional  example  distributions)  dominate  the  learning  procedure;  yr  •  the  output  representing  the 
true  class  of  a  given  training  example,  is  frequently  unity,  and  Yr,  the  largest  other  output,  is  generally 
less  than  0.5.  The  harder  examples,  corresponding  to  the  tails  of  the  empirical  class-conditional  example 
distributions,  tend  to  cluster  along  contours  of  constant  CE.  Many  of  these  examples  exhibit  zero  values  of 
Yr-  Again,  this  kind  of  output  state  reflects  the  general  nature  of  probabilistic  learning. 
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Figure  7.10:  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter  logistic 
linear  classifier  shown  in  figure  7.3  as  it  learns  probabilistically  (Kullback-Leibler  —  CE).  The  classifier’s 
empirical  error  ra.-  after  350  learning  epochs  is  2.0  (+2.9/-2.0)%. 


In  fairness  to  the  probabilistic  models,  they  are  not  significantly  worse  than  the  differential  model. 
Figure  7. 1  clearly  indicates  that  the  empirical  class-conditional  example  distributions  are  reasonably  well 
separated  and  unimodal  —  conditions  for  which  the  logistic  linear  classifier  is  a  reasonable  approximation 
to  a  proper  parametric  model  of  X  (see  definition  3.13  and  appendix  F).  As  a  result,  we  expect  — 
and  obtain  —  reasonably  good  discrimination  from  the  probabilistically-generated  classifiers.  As  we  shall 
see  in  section  7.7,  the  disparity  between  differential  and  probabilistic  learning  can  be  significant  when  the 
hypothesis  class  is  not  a  proper  parametric  model  of  the  data. 
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Figure  7,11:  The  15-parameter  logistic  linear  classifier’s  output  state  — as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (CE  —  see  figure  7.10). 


7.6  Rejecting  the  Classiflcation 


Stochastic  concepts  are  sometimes  confusable  with  one  another  since  the  class-conditional  pdfs  of  their 
common  feature  vector  overlap.  Just  as  there  are  easy  and  hard  learning  examples,  there  are  easy  and  hard  test 
examples.  When  human  subjects  cannot  identify  a  concept  with  high  confidence  they  usually  give  a  “don’t 
know’’  classification.  Synthetic  pattern  recognition  systems  often  incorporate  an  analogous  mechanism 
that  rejects  test  classifications  that  do  not  meet  a  minimum  standard  of  "confidence’’.  Classical  decision 
theory  formalizes  the  mechanism  by  which  classification  hypotheses  are  rejected.  In  general  the  mechanism 
establishes  a  reject  region  (e.g.,  [40,  pp.  78-82])  about  the  class  boundaries  on  \  inside  which  the  classifier 
always  yields  a  “don’t  know’’  classification. 

Since  differential  learning  is  a  discriminative  form  of  learning,  which  focuses  on  estimating  the  class 
boundaries  on  “X.,  ii  is  naturally  compatible  with  the  rejection  mechanism  described  above.  The  reject 
region  on  X  maps  to  reduced  discriminator  output  space  in  a  straightforward  manner.  Figure  7. 1 2  illustrates 
this  for  the  differential  logistic  linear  classifier  depicted  in  figure  7.4.  The  light  gray  shading  on  reduced 
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Figure  7.12:  Figure  7.4  shown  with  a  rejection  threshold  of  6„ita  =  0*35  (see  text)  in  light  gray.  For  this 
level  of  confidence  (  0.6)  and  the  rejection  threshold  shown,  the  classifier  rejects  1 .3%  of  the  samples  and 
misclassifies  1.3%. 
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Figure  7.13:  The  differentially-generated  logistic  linear  classifier’s  output  state  after  attempting  to  learn  the 
Iris  data  with  lower  confidence  (0.26),  shown  with  a  rejection  threshold  of  =  0.15  (see  text)  in  light 
gray.  For  this  level  of  confidence  and  the  rejection  threshold  shown,  the  classifier  rejects  3.3%  of  the  samples 
and  misclassifies  0.7%. 


Rejecting  the  Classification 
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Figure  7. 14:  Figures  7.9  (MSE,  top)  and  7. 1 1  (CE,  bottom)  shown  with  a  rejection  threshold  of  S^ta  =  0.35 
in  light  gray.  Top:  The  MSE-generated  classifier  rejects  1 2.7%  and  misclassifies  0%  of  the  training  sample  for 
Oie  S„jtct  =  0.35  rejection  threshold.  The  darker  gray  region  corresponds  to  a  (larger)  reject  region  defined  in 
terms  of  MSE;  its  rejedtion/misclassification  characteristics  are  worse.  Bottom:  Tbe  CE-generated  classifier 
rejects  14%  and  misclassifies  0.7%  of  the  training  sample  for  the  S„j,ci  =  0.35  rejection  threshold.  The 
darker  gray  region  corresponds  to  a  (larger)  reject  region  defined  in  terms  of  CE;  its  rejection/misclassification 
characteristics  are  worse. 
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discriminator  output  space  (upper  right)  and  the  reduced  discriminant  continuum  (lower  left)  denotes  the 
reject  region.  After  learning  the  training  sample,  we  can  set  the  reject  region  in  such  a  way  that  we  obtain  an 
acceptable  trade-off  between  error  rate  and  rejection  rate  (i.e.,  the  percentage  of  the  total  sample  for  which 
the  classification  is  rejected).  The  region  is  specified  by  a  minimum  discriminant  differential  Snjrct  below 
which  the  classification  is  rejected.  Of  course,  given  test  cases  for  which  we  don’t  know  the  true  class  label, 
we  always  assume  that  the  discriminant  differential  is  positive  —  that  is,  we  always  assume  the  classification 
is  correct  Given  the  informed  perspective  from  which  we  know  the  true  class  label  of  each  example,  the 
discriminant  differential  might  be  negative.  Thus,  the  reject  region  spans  the  interval  [—5„jeci,S„ject\  in 
reality.  In  the  experimental  chapters  that  follow,  we  use  a  simple  default  setting  for  Srejea  •  based  on  the  value 
of  ll' :  Srejea 's  One  half  the  value  of  6  at  the  upper  end  of  the  synthetic  CFM  sigmoid's  linear  transition  leg 
(i.e.,  5  of  Xtp  in  figure  D.  I  ).* 

Figure  7.12  illustrates  a  reject  region  corresponding  to  a  S„j,ct  that  is  twice  the  default  value.  We  double 
the  default  width  of  the  reject  region  because  the  classifier  has  learned  all  ISO  Iris  examples;  we  set  a  higher 
than  normal  standard  of  confidence  below  which  we  reject  the  classification  for  the  purpose  of  illustration. 
Given  this  reject  region,  the  classifier  rejects  1 .3%  (2)  of  the  training  sample  classifications,  at  the  cost  of 
misclassifying  1.3%  of  the  sample.  If  we  decrease  the  confidence  with  which  we  leam  to  0.26  (a  result  not 
previously  shown)  and  again  set  a  S„jtci  that  is  twice  the  default  value,  we  reject  3.3%  (5)  of  the  training 
sample  at  the  cost  of  misclassifying  0.7%  ( I ).  This  scenario  is  depicted  in  figure  7. 1 3.  Finally,  if  we  apply  a 
6„jfct  that  is  twice  the  default  value  to  the  results  of  figure  7.S,  we  reject  none  of  the  training  examples  at  the 
cost  of  misclassifying  1 .3%  (2). 

The  tradeoff  between  error  rate  and  rejection  rate  in  these  three  scenarios  remains  both  balanced  and 
relatively  stable  across  a  wide  range  of  learning  confidence  parameters,  corresponding  to  a  wide  range  of 
reject  regions.  This  is  not  the  case  for  the  classifier  that  employs  probabilistic  learning.  Since  the  classifier  that 
learns  with  ip  =  0.6  in  figures  7.4  and  7.12  exhibits  the  same  empirical  error  rate  as  the  probabilistically- 
generated  variants  in  figures  7.9  and  7. 1 1 ,  we  apply  the  reject  region  shown  in  figure  7. 1 2  to  figures  7.9  and 
7.1 1  as  a  means  of  fairly  comparing  differential  and  probabilistic  learning.  Given  this  reject  region,  depicted 
by  the  light  shading  in  figure  7.14,  the  MSE-generated  classifier  rejects  12.7%  (19)  of  the  sample  at  the  cost 
of  making  no  misclassifications;  the  Kullback-Leibler  (CE)  variant  rejects  14%  of  the  sample  at  the  cost  of 
misclassifying  0.7%  (I).  Thus,  probabilistically-generated  classifiers  reject  a  significantly  higher  proportion 
of  examples  without  attaining  a  significantly  lower  error  rate.  If  the  reject  region  is  defined  in  terms  of 
the  MSE  or  CE  that  an  example  (always  assumed  to  be  correctly  classified)  elicits,  the  reject  regions  are 
depicted  by  the  light  and  dark  shading  in  figure  7.14.  The  resulting  rejection/misclassification  statistics  for 

*We  do  not  proffer  a  Iheoielicatly  justified  approach  to  setting  i  slthough  we  believe  that  this  is  an  important  avenue  of 

research,  we  limit  ourselves  to  this  acknowledgement  in  the  interest  of  branding  the  present  text's  scope. 


Figure  7.15:  The  empirical  error  rates  (training  sample  with  all  150  examples)  for  the  15-parameter  linear 
classifier  as  it  learns  differentially  (CFM).  The  classifler’s  empirical  error  rate  after  350  learning  epochs  is 
1.3(+2.5/-l.3)%. 


these  MSE/CE-based  reject  regions  are  worse  than  those  for  the  6n./n7'based  regions  —  further  evidence 
that  minimizing  an  error  measure  is  not  monotonicaliy  related  to  minimizing  the  classifier’s  error  rate. 

7.7  The  Importance  of  Representational  Choices 


In  section  7.5  we  found  that  differential  learning  did  not  produce  a  logistic  linear  classifler  with  a  significantly 
lower  empirical  error  rate  than  its  probabilistically-generated  counterparts.  We  attributed  this  to  the  logistic 
linear  hypothesis  class,  which  is  a  good  approximation  to  the  proper  parametric  model  of  the  Iris  feature 
vector.  In  this  section  we  explore  the  effects  of  changing  the  hypothesis  class  from  the  logistic  functional 

basis  to  alternative  functional  bases:  linear  and  Gaussian  radial  basis. 

Figure  7.15  shows  the  learning  curve  for  a  simple  linear  classifier  (i.e.  one  with  discriminant  functions 
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Figure  7. 1 6:  The  1 5-parame(er  differentially-generated  linear  classifier’s  output  state  —  as  projected  onto  the 
reduced  discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  with  low  confidence.  The  classifier 
cannot  learn  examples  83  and  133  (see  figure  7.2). 
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Figure  7.17:  The  empirical  error  rates  (training  sample  with  all  ISO  examples)  for  the  IS-parameter  linear 
classifier  as  it  learns  probabilistically  (MSE)).  The  classifier’s  empirical  error  rate  after  350  learning  epochs 
is  14.7(+6.2/-5.8)%. 


that  are  simple  linear  functions  of  the  feature  vector).  The  classifier  employs  differential  learning  with 
scheduled  confidence  reduction  from  0.6  at  epoch  zero  to  0.04  beyond  epoch  200.  The  only  appreciable 
learning  difference  between  this  linear  classifier  and  its  logistic  linear  counterpart  shown  in  figure  7.7  is  in 
their  convergence  rates.  The  simple  linear  model  learns  the  easy  examples  and  most  of  the  hard  ones  faster 
than  the  logistic  model.  We  attribute  this  phenomenon  solely  to  the  change  in  the  classifier’s  functional 
basis.  The  linear  model  learns  at  a  rate  that  is  approximately  linearly  proportional  to  the  training  sample  size 
(section  S.S.  1 ).  The  logistic  model  learns  at  a  rate  that  is  exponentially  proportional  to  the  training  sample  size 
as  its  parameters  grow  large  because  large  parameter  values  drive  the  logistic  non-linearities  towards  their 
limiting  step  functional  form.  The  proof  of  this  assertion  follows  directly  from  section  D.3. 1 ,  which  proves 
that  gradient-based  learning  via  the  original  logistic  sigmoidal  form  of  CFM  becomes  unreasonably  slow 
as  the  CFM  sigmoid  approaches  its  limiting  step  functional  form.  Faster  convergence  notwithstanding,  the 
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Figure  7.18:  The  15-parameter  linear  classifier's  output  state  —  as  projected  onto  the  rertiic»*d  '^•'''nminant 
continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (MSE).  The  classifier  cannot  learn  22  of 
the  examples. 


linear  and  logistic  linear  classifiers  exhibit  the  same  final  empirical  error  rate  of  1 .3  (+2.5/- 1 .3)%.  Figure  7. 1 6 
shows  the  reduced  discriminator  output  state  of  the  linear  classifier  after  350  learning  epochs.  Note  that 
because  the  classifier  is  linear  in  X  its  output  is  on  y  =  rather  than  (0,1]-^.  This  explains  why  a 
number  of  the  training  examples  appear  outside  the  unit  square  in  the  figure.  Despite  the  substantial  change 
in  functional  basis,  this  linear  classifier  exhibits  the  same  learning  characteristics  as  its  logistic  counterpart: 
it  fails  to  learn  examples  83  and  133. 

The  same  is  not  true  for  the  simple  linear  classifier  that  employs  probabilistic  learning.  Because  the 
linear  discriminant  functions  are  a  decidedly  improper  parametric  model  of  the  Iris  feature  vector,  they  have 
insufficient  functional  complexity  to  approximate  the  a  posteriori  class  probabilities  of  X.  Figure  7.17 
shows  the  probabilistic  learning  curve  for  the  MSE  objective  function.  The  classifier  cannot  learn  14.7 
(+6.2/-5.8)%  (22)  of  the  1 50  examples,  which  are  clearly  visible  in  figure  7. 1 8.  Note  that  most  of  these 


un-leamable  training  examples  fall  inside  the  0.36  MSE  contour.  All  of  these  examples  are  learned  with  the 
differential  strategy. 


7. 7  Kepresentatioiial  Choices 


211 


Figure  7.19:  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  for  the  1 5-parameter  modified 
RBF  classifier  as  it  learns  differentially  (CFM).  The  classifier’s  empirical  error  rate  after  350  learning  epochs 
is  2.0  (+2.9/-2.0)%. 


Figure  7.19  shows  the  learning  curve  for  a  modified  radial  basis  function  (RBF)  classifier  (see  appendix  K) 
that  employs  differential  learning.  The  classifier  has  no  hidden  layer  nodes,  only  three  output  nodes 
corresponding  to  the  three  discriminant  functions.  For  both  differentially  and  probabilistically-generated 
variants,  the  mean  vectors  of  the  model  {/*| ,  fij ,  are  initialized  to  the  empirical  class-conditional  means 
of  the  training  sample,  and  the  single  variance  parameters  {<7|  ,^2,173}  of  (K.3)  are  set  to  the  average 
eigenvalue  of  their  corresponding  empirical  class-conditional  covariance  matrix.  This  initialization  procedure 
is  not  necessary;  it  simply  reduces  the  learning  time  for  all  models.’  The  differentially-generated  model 

’The  critical  reader  will  note  Ihal  Ihe  initialization  procedure  is  fundanwntally  probabilistic.  In  this  sense  the  reader  might  think 
it  logically  inconsistent  for  us  to  follow  such  an  initialization  with  differential  learning,  claiming  some  advantage  over  probabilistic 
learning.  Realizing  this  potential  inconsistency,  we  ran  a  series  of  simulations  without  such  initialization.  The  results  for  this  and  other 
tasks  were  funuamenlally  identical  to  the  results  with  initialization,  the  only  difference  being  that  the  initialized  models  learned  mtKh 
more  quickly.  In  our  view,  one  of  Ihe  principal  weaknesses  of  radial  basis  functions  is  their  very  local  nature,  which  leads  to  learning 
times  Ihal  increase  exponentially  as  Ihe  RBF  covariance  matrix  eigenvalues  decrease  (i.e.,  as  Ihe  RBF  nodes  become  increasingly  local). 


212 


Chapter  7:  Implementing  Differential  Learning 


Figure  7.20:  The  I  S-paraiueter  difTerential  modified  RBF  classifier’s  output  state  —  as  projected  onto  the 
reduced  discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  with  low  confidence.  The  classifier 
cannot  learn  examples  70,  83,  and  1 33  (see  figure  7.2). 
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Figure  7.2 1 ;  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  for  the  1 5-parameter  modified 
RBF  classifier  as  it  leams  probabilistically  (MSB)).  The  classifier’s  empirical  error  rate  after  350  learning 
epochs  is  4.7  (+4.0/-3.4)%. 

leams  all  the  examples  except  70, 83,  and  1 33  after  350  epochs  (confidence  is  reduced  from  0.6  at  epoch  zero 
to  0.14  beyond  epoch  200).  The  classifier’s  final  reduced  discriminator  output  state  is  shown  in  figure  7.20. 
Again,  this  differentially-generated  classifier  is  not  substantially  worse  than  its  logistic  linear  counterpart, 
despite  the  substantia)  change  in  the  discriminator's  functional  basis.  These  results  indicate  that  differential 
learning  is  relatively  insensitive  to  the  representational  scheme  of  the  hypothesis  class  (i.e.,  its  functional 
basis).  By  the  proofs  of  chapter  3,  differential  learning  will  produce  the  lowest  MSDE  possible,  given  the 
representational  scheme.  These  experiments  bear  that  out. 

Figures  7.2 1  —  7.24  illustrate  the  learning  curves  and  final  reduced  discriminator  output  state  for  the  MSE 
and  Kullback-Leibler  probabilistic  RBF  variants,  "h  ■  MSE-gencratcd  classifier  exhibits  a  4.7  (+4.(V-3.4)% 


By  placing  the  radial  basis  functions  in  quasi-oplimal  locations  on  feature  space  at  the  outset,  this  probabilistic  initialization  procedure 
reduces  teaming  times  substantially. 
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Figure  7.22:  The  15-parameter  modiFied  RBF  classifier’s  output  state  — as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (MSE).  The  classifier  cannot 
learn  7  of  the  examples. 


empirical  error  rate  after  350  learning  epochs,  whereas  the  Kullback-Leibler-generated  classifier  exhibits 
a  6.7  (44.6/-4.0)%  rate  after  the  same  number  of  epochs.  Again  we  see  the  non-monotonic  nahire  of 
probabilistic  learning  when  the  hypothesis  class  is  not  a  proper  parametric  model  of  the  feature  vector. 
Most  of  t.  .e  misclassified  training  examples  in  figure  7.22  fall  inside  the  0.36  MSE  contour,  while  most 
of  those  in  figure  7.24  fall  inside  the  0.824  CE  contour.  It  is  ironic  that  there  are  regions  on  the  correct 
side  of  reduced  discriminant  space  in  which  MSE/CE  is  the  same  or  higher.  Clearly,  probabilistic  learning 
minimizes  the  functional  error  between  the  discriminator  and  the  training  sample  without  regard  to  whether 
an  example  is  learned  or  un-leamed.  As  a  result,  the  easy  examples  (i.e.,  the  majority  of  the  training  sample) 
dominate  the  error  minimization,  and  the  hard  examples  are  never  learned.  When  the  hypothesis  riass  is  a 
distinctly  improper  parametric  model,  as  it  is  in  figures  7.19, 7.21,  and  7.24,  the  phenomenon  is  particularly 
pronounced;  decreasing  the  functional  error  paradoxically  increases  the  discriminant  error. 


7.H  Minimizing  Complexity 
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Figure  7.23:  The  empirical  error  rates  (training  sample  with  all  1 50  examples)  for  the  15-parametcr  modified 
RBF  classifier  as  it  learns  probabilistically  (Kullback-Leibler  —  CE^).  The  classifier’s  empirical  error  rate 
after  350  learning  epochs  is  6.7  (+4.6/-4.0)%. 


7.8  Minimizing  the  Classifier’s  Complexity 

One  way  to  learn  all  the  Iris  examples  probabilistically  is  to  increase  the  functional  complexity  of  the 
hypothesis  class  enough  so  that  minimizing  the  resulting  classifier's  functional  error  is  sure  to  minimize  the 
empirical  error  rate  for  the  training  sample.  If  we  do  this  we  are  likely  to  cut  ourselves  on  Occam’s  razor. 
Specifically,  by  increasing  the  classifier’s  complexity  we  reduce  its  discriminant  bias  at  the  cost  of  increasing 
its  discriminant  variance:  the  net  effect  for  small  training  sample  sizes  is  an  increase  in  the  classifier’s 
mean-squared  discriminant  error  (MSDE)  —  a  phenomenon  we  shall  see  repeatedly  in  the  chapters  that 
follow.  The  functional  bias-variance  tradeoff  is  well-known  both  in  detection  and  estimation  theory,  and  the 
connectionist  literature  (e.g.,  [41]).  We  remind  the  reader  that  we  are  discussing  a  very  different  tradeoff 
between  discriminant  bias  and  variance  (see  chapter  3). 
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Reduced  Dlscrimlnanl  Continuum  (CE)  F7 


Figure  7.24:  The  15-parameter  modiFied  RBF  classifier’s  output  state  — as  projected  onto  the  reduced 
discriminant  continuum  —  after  attempting  to  learn  all  the  Iris  probabilistically  (Kullback-Leibler  —  CE). 
The  classifier  cannot  learn  10  of  the  examples. 


Our  Iris  experiments  illustrate  the  importance  of  corollary  3.1: 

•  because  differential  learning  is  asymptotically  efficient,  it  guarantees  the  lowest  discriminant  bias 
possible  for  a  particular  choice  of  hypothesis  class. 

•  because  differential  learning  requires  the  hypothesis  class  with  the  least  functional  complexity  necessary 
.  to  achieve  a  specific  level  of  discriminant  bias  for  asymptotically  large  training  samples,  it  guarantees 

the  least  discriminant  variance  and,  as  a  result,  the  lowest  MSDE  possible  for  small  training  sample 
sizes.’® 

We  shall  see  in  the  following  chapters  that  the  minimum  complexity  requirement  of  differential  learning 

is  the  key  to  its  producing  classifiers  with  consistently  lower  empirical  error  rates  than  those  produced  by 

probabilistic  learning.  We  find  that  many  interesting  pattern  recognition  tasks  can  be  learned  with  simple 

classifiers,  which  generalize  well  to  unseen  test  examples  by  virtue  of  their  simplicity. 

"’The  one  exception  to  this  guarantee  is  when  the  hypothesis  class  is  a  minimum-complexity  proper  parametric  model  of  X ,  as 
described  in  section  .1.6  and  appendix  F. 


7. 9  Summarv 
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The  15-paranieter  logistic  linear  classifier  employing  differential  learning  exhibits  a  2.7(+2.2/-2.6)% 
error  rate  (4  errors)  when  it  learns  and  is  tested  in  ISO  leave-one-out  [84]  trials."  This  result  suggests  that  the 
logistic  linear  classifier’s  MSDE  is  relatively  low  and  that  the  model  generalizes  well.  The  best  independent 
leave-one-out  test  result  for  any  classifier  is  a  statistically  equivalent  2.0(-h2.9/-2.0)%  error  rate  (3  errors), 
reported  in  [  1 39,  ch.  6].  The  I  S-parameter  logistic  linear  classifier  employing  probabilistic  learning  via  MSE 
exhibits  a  6.0(+4.4/-3.9)%  error  rate  (9  errors)  when  it  learns  and  is  tested  in  150  leave-one-out  trials.  The 
I  S-parameter  logistic  linear  classifier  employing  probabilistic  learning  via  CE  exhibits  a  8.0(-«-S.0/-4.4)% 
error  rate  (12  errors)  when  it  learns  and  is  tested  in  ISO  leave-one-out  trials. 

7.9  Summary 

The  Iris  classification  task  is  an  interesting  case  study  because  the  task  is  real,  not  fabricated,  the  data  have 
been  studied  extensively  by  a  number  of  authors,  and  the  classes  are  not  quite  linearly  separable.  As  a 
result,  the  task  is  neither  trivial  nor  hard,  and  it  provides  a  good  comparison  of  differential  and  probabilistic 
learning.  By  visualizing  the  Iris  data  in  two  dimensions,  we  find  that  a  linear  classifier  should  be  able  to  learn 
all  but  two  of  the  ISO  examples.  The  two  linear  classifiers  (ones  with  linear  and  logistic  functional  bases) 
employing  differential  learning  do  indeed  learn  all  but  two  of  the  examples.  A  (non-linear)  modified  RBF 
classifier  learns  all  but  three  of  the  examples.  Comparable  probabilistically-generated  classifiers  cannot  learn 
as  many  of  the  examples,  illustrating  both  the  inefficiency  of  probabilistic  learning  and  its  sensitivity  to  the 
representational  scheme. 

Because  the  Iris  data  constitute  a  3-class  pattern  recognition  task  with  a  4-element  feature  vector,  and 
because  the  classes  are  nearly  separable,  the  learning  task  is  relatively  easy.  In  the  chapters  that  follow,  we 
explore  learning  tasks  that  are  somewhat  harder.  Throughout  these  experiments,  we  find  the  results  of  this 
chapter  repeated:  differential  learning  is  efficient,  producing  the  classifier  that  generalizes  best  for  a  given 
choice  of  hypothesis  class. 


II 


The  9.S%  confidence  bounds  we  give  arc  based  on  the  assumpiion  that  each  leave-one-out  trial  is  a  Bernoulli  trial  [62]. 


Chapter  8 

Optical  Character  Recognition  with 
Differential  Learning 


Outline 

We  use  a  linear  classifier  employing  differential  learning  to  recognize  handwritten  digits  of  the  AT&T  "little" 
(DB I )  database.'  The  classifier  has  6S0  total  parameters  for  this  optical  character  recognition  (OCR)  task. 
After  learning  the  benchmark  training  sample,  the  classifier  exhibits  a  1 .3%  error  rate  on  the  benchmark 
test  sample.  Its  probabilistically-generated  counterparts  exhibit  twice  this  error  rate,  as  does  the  the  best 
independently-developed  linear  classifier.  A  differentially-generated  simple  Gaussian  radial  basis  function 
(RBF)  classifier  achieves  a  2.0%  error  rate  on  the  benchmark  test  sample  —  not  substantially  worse  than 
the  linear  model,  despite  the  substantial  ‘ ’representational”  change  (i.c.,  the  change  in  functional  basis).  An 
identical  probabilistically-generated  RBF  exhibits  a  10.3%  error  rate  on  the  benchmark  test  sample.  We  use 
noisy  versions  of  the  DB  I  database  to  illustrate  the  special  (and  readily  discernible)  conditions  under  which 
differential  learning  might  not  produce  the  best-generalizing  classifier  for  small  training  sample  sizes. 

8.1  Introduction 

The  AT&T  DB  1  database  contains  1200  handwritten  digits;  ten  examples  of  each  digit,  obtained  from  each 
of  twelve  different  subjects  [47].  Figure  8. 1  illustrates  40  examples  from  the  database.  Each  example  is 
a  2S6-pixel  ( 16  x  16)  binary  image  (i.e.,  pixels  are  cither  black  =  -1  or  while  =  +1).  The  examples  are 
well-defined  to  the  human  eye  and  have  uniform  scale  and  orientation.  Since  its  introduction,  the  database 
has  become  a  benchmark  standard  for  evaluating  learning  procedures  and  neural  network  architectures  in 
the  optical  character  recognition  ((XTR)  domain.  We  in  turn  use  the  database  to  illustrate  the  theoretical 
arguments  of  part  I. 

'  We  Ihank  Dr.  lubelle  Guyon  of  AT&T  for  providing  us  with  the  DB  I  database.  Readers  interested  in  previous  research  on  this 
database  should  review  |46. 41. 47. 16). 


220 


Chapter  8:  Optical  Character  Recognition 


a 

m 

a 

■ 

M 

s 

WM 

M 

r 

H 

9 

SI 

n 

a 

n 

m 

m 

Figure  8. 1 :  Forty  digits  randomly  chosen  from  the  AT&T  DB I  database. 


We  show  that  compressing  the  256-pixel  ( 16  x  16)  binary  images  to  64-pixel  (8x8)  5-state  images 
allows  us  to  employ  less  complex  classifiers,  which  exhibit  lower  test  sample  error  rates  than  those  designed 
for  the  un-compressed  images.  Specifically,  simple  linear  and  non-linear  classifiers  with  650  parameters^ 
(6S/digit)  — one  fourth  the  number  of  parameters  necessary  for  the  256-pixel  images  — classify  the 
compressed  images  with  test  sample  error  rates  on  the  order  of  2%.  We  compare  these  classifiers,  which  learn 
differentially,  with  counterparts  that  learn  probabilistically.  The  latter  exhibit  error  rates  that  are  between  1 .7 
and  3.5  times  the  differentially-generated  models’  rates,  depending  on  the  classifier’s  functional  basis.  We 
conclude  by  extending  the  experiments  of  [41]  in  which  the  original  256-pixel  binary  images  are  corrupted 
by  noise  that  takes  the  form  of  randomly  inverted  pixel  states.  We  derive  simple  signal-to-noise  ratio  (SNR) 
expressions  for  the  noisy  images.  We  then  use  compressed  versions  of  the  noisy  images  to  illustrate  the 
special  (and  readily  discernible)  circumstances  under  which  differential  learning  might  not  generate  the 
relatively  efficient  classifier  (definition  3.15  —  i.e.,  the  one  with  the  lowest  MSDE  allowed  by  the  choice  of 
hypothesis  class)  for  small  training  sample  sizes. 

8;1.1  A  Word  Regarding  Training  and  Test  Samples 

Throughout  this  chapter  we  refer  to  a  “benchmark  split’’  of  the  DBI  database.  This  term  refers  to  the 
partitioning  of  the  database  into  a  training  sample  and  test  sample.  Both  samples  contain  600  examples.  The 
benchmark  training  sample  comprises  the  first  five  examples  of  each  digit,  obtained  from  each  of  the  twelve 
subjects.  The  benchmark  test  sample  comprises  the  last  five  examples  of  each  digit,  obtained  from  each  of 

^ThereareC  =  10  discriminani  functions,  and  the  augmented  feature  vector  has  V -I-  I  =  6.S  elements.  Therefore  the  classifier 
has  to  '  65  =  650  total  parameters. 


8. 2  Test  <&  Evaluation  Protocols 
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the  twelve  subjects.  1  his  benchmark  split  has  been  used  in  a  number  of  previous  papers  on  the  database;  we 
use  it  in  order  to  compare  our  results  with  previously  published  ones.  We  also  run  multiple  trials  in  which 
the  training  examples  are  selected  randomly  from  the  entire  1 200-example  database  with  probability  5  (see 
sections  8.2. 1  and  8.2.2).  Since  these  random  splits  of  the  database  do  not  guarantee  a  balanced  number  of 
examples  of  each  digit  for  each  subject,  the  empirical  test  sample  error  rates  of  classifiers  generated/tesled 
with  them  are  typically  higher  than  the  rate  for  the  benchmark  split 

8.2  Test  and  Evaluation  Protocols 


Classifier  comparisons  are  strictly  controlled:  Throughout  this  entire  text,  when  we  compare  clas¬ 
sifiers  that  employ  differential  learning  with  those  that  employ  probabilistic  learning,  all  experimental 
conditions  in  a  given  trial  are  identical  except  for  the  objective  fimction  used  to  drive  the  learning 
procedure.  Learning  rates,  momentum  terms, ^  weight  decay  and  or  weight  smoothing  constants  (see 
appendix  M),  training  and  test  samples,  the  hypothesis  class  and  its  associated  parameter  space,  etc. 
—  all  of  these  factors  are  identical:  only  the  objective  function  is  different.  Furthermore,  learning  is 
completely  automated  after  task/classifier  setup,  so  there  is  no  human  intervention  during  the  actual  learning 
process.  These  controls  aim  at  an  nn-biased  comparison  of  differential  and  probabilistic  learning  strategies. 


8.2.1  Estimating  Error  Rates 

All  estimated  error  rates  quoted  in  this  text  are  based  on  classification  results  for  test  samples  that  have  no 
examples  in  common  with  the  training  sample  used  for  learning. 

Definition  8.1  Estimated  error  rate:  Given  a  single  test  sample  of  size  ,  the  estimated  error  rate  — 
which  we  denote  by  Pt(  -  , »/)  — for  the  classifier  with  discriminator  0{X\O)  is  simply  the  ratio  of  test 
sample  errors  E(r/)  to  the  total  number  of  test  examples  1/ ; 

^  5(»/)  number  of  lest  sample  errors 

=  -  =  - ; — : -  (o-  • ) 

1)  test  sample  size 


Remark:  We  sometimes  refer  to  P,  {Q  |d,f;)  as  the  classifier’s  empirical  test  sample  error  rate.  It  is 
valid  to  view  P,{Q\0,  >/)  as  an  asymptotically  unbiased,  maximum-likelihood  estimator  of  the  classifier’s 


'^Learning  is  a  search  over  parameler  space  for  Ihe  parameteriraiion  ihal  maximizes  the  objeclive  function,  given  Ihe  differentiable 
supervised  cTassifier  and  Ihe  training  sample.  We  employ  a  variant  of  Ihe  simple  gradienl-ba.s^  search  algorithm  with  “momentum" 
typically  associated  with  the  backpmpagalion  algorithm  (e.g..  (1 19. 1 20]). 
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true  error  rate  P,  {Q\R).  Indeed,  P,  itself  is  a  binumially-distnbuted  random  variable  with 

mean  p  =  V,{Q\9).  We  assume  that  p  “  1 0 ,  t/)  in  order  to  compute  95%  confidence  bounds  on 

Pe{g\e,„)  in  the  manner  described  by  Highleyman  [62]. 

We  judge  the  error  rates  of  two  different  classifi'r/leaming  strategy  combinations,  estimated  from  a 
single  leaming/test  trial  (involving  a  single  training/test  sample),  to  be  significantly  different  if  their  95% 
confidence  intervals  do  not  overlap.  When  the  classifier/ieaming  strategies  are  compared  over  a  series  of 
independent  trials  (each  involving  an  independent,  randomly-selected  training/test  sample),  we  consider  them 
to  be  significantly  different  if  one  classifier’s  empirical  test  sample  error  rate  is  consistently  lower  than  the 
other  classifier's. 

It  is  important  to  clarify  the  nature  of  our,  “independent,  randomly-selected  training/test  samples."  In  this 
chapter  we  have  1200  total  digit  examples.  Each  randomly-selected  training  sample  contains  approximately 
600  examples;  the  associated  test  sample  contains  all  the  examples  in  the  original  set  of  1 200  that  are  not  in 
the  training  sample.  Different  training/test  samples  are  independent  to  the  extent  that  they  contain  different 
randomly-selected  sub-sets  of  the  original  1 200-example  database;  the  selection  procedures  are  independent 
across  trials.  We  denote  the  size  of  the  kih  training  sample  by  lu  and  the  size  of  the  associated  i(:th  test 
sample  by  t]t.  Thus,  our  25  independent  leaming/test  trials  in  this  chapter  (and  similar  trials  in  chapter  9) 
constitute  25  repetitions  of  a  2-fold  cross  validation  procedure  (e.g.,  [139,  91])  by  which  we  estimate  the 
classifier’s  true  eiror  rate  P,  {Q\9). 

Cross  validation:  In  general,  2-fold  cross  validation  is  done  by  dividing  all  the  labeled  examples  of  the  feature 
vectorintoatrainingsampleandatestsampleofapproximatelyequalsize(i.e.,  n*  «  ?/*  — a 50/50 partition¬ 
ing,  or  ‘  ‘split ' ',  of  the  entire  data  sample).  We  use  this  protocol  throughout  this  text  unless  otherwise  stated. 


Repeated  2-fold  cross  validation  generates  an  error  rate  estimator  (and,  as  a  result,  a  discriminant  bias 
estimator)  with  relatively  low  bias  and  variance.^  The  procedure  also  allows  us  to  estimate  the  classifier’s 
discriminant  variance  and  MSDE. 

8.2.2  Estimating  a  ClassiHer’s  MSDE 

Given  K  independent,  randomly-selected  2-fold  cross  validation  training  samples  with  sizes  .rtjr} 

and  associated  test  samples  with  sizes  {r/i . >)x}.  we  define  the  following  estimators  of  the  expectations 

(defined  in  section  3.2)  for  the  classifier’s  error  rate,  discriminant  bias,  discriminant  variance,  and  mean- 
squared  discriminant  error  (MSDE).  The  notation  E(q)t  denotes  the  number  of  misclassifications  made  on 
the  Ath  test  sample  of  size  tji,. 

^The  reader  will  find  a  very  readable  overview  of  the  extensive  literature  relating  to  classifier  error  rate  estimation  in  [  1 39.  sec.  2..S]. 
Those  seeking  a  more  detailed  treatment  of  this  material  will  find  it.  along  with  extensive  references  to  the  literature,  in  (91,  ch.  10). 
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Definition  8.2  Average  estimated  error  rate:  Given  the  classifier  repeatedly  generated  from  the 

hypothesis  class  G(0)  by  the  learning  strategy  A  over  K  2-fold  cross  validation  trials,  its  average 
estimated  error  rate  is  simply  the  average  of  the  estimated  error  rate  in  (8.1)  across  all  trials: 

. m})  =  ^ 

*=i  *=i 


Remark:  We  sometimes  refer  to  P,(^|d,  {»;i , ...  astiiec\&ssir\eT'saverage(empirical)testsample 
error  rate.  Note  that  (8.2)  is  an  estimate  based  on  the  average  of  K  2-fold  cross  validation  trials,  whereas 
(8. 1 )  is  based  on  a  single  2-fold  cross  validation  trial. 

Definition  8.3  Estimated  discriminant  bias:  Given  the  classifier  repeatedly  generated  from  the 

hypothesis  class  G{0)  hy  the  learning  strategy  A  over  K  2-fold  cro.ss  validation  trials,  its  estimated 
discriminant  bias  is  its  average  estimated  error  rate  minus  the  estimated  Bayes  error  rate  P,  (^sayts) : 

DBias  [Q I  {«! , . . .  ,  «ir} ,  G(0),  A]  =  P,  (^  | » ,  {»/ . >?Ar})  -  P?  (8  3) 

Remark:  In  general  we  do  not  know  the  Bayes  error  rate  for  X.  The  most  conservative  estimate  is  that 
Pf  =  0  (i.e.,  the  Bayes-optimal  classiBer  can  classify  examples  of  X  without  error).  Note  from 

the  definitions  of  section  3.2  that  a  classifier's  discriminant  bias  and  MSDE  are  maximized  (as  a  function 
of  P,  (^B,iv«) )  when  P,  =  0-  Thus,  if  we  assume  P,  (^Bayn)  =  0,  we  are,  if  anything, 

over-estimating  the  classifier’s  discriminant  bias  and  MSDE.  In  the  case  of  the  DB I  digit  recognition  task, 
humans  typically  recognize  all  1 200  examples  without  error,  so  we  assume  that  the  Bayes  error  rate  for  this 
task  is  indeed  zero. 

Definition  8.4  Estimated  discriminant  variance:  Given  the  classifier  repeatedly  generated  from  the 
hypothesis  class  G(0)  by  the  learning  strategy  A  over  K  2-fold  cross  validation  trials,  its  estimated 
discriminant  variance  is  the  "sample  variance"  of  its  estimated  error  rate: 

DVar[^|  {n . «/r},G(0),A]  =  ^  5Z  (P<(^l®>'rt)  “  h{Q\9,{ii . »/jr}))  (8.4) 


Definition  8.5  Estimated  mean-squared  discriminant  error  (MSDE):  Given  the  classifier  repeatedly 
generated  from  the  hypothesis  cla.ss  G(0)  hy  the  learning  strategy  A  over  K  2-fold  cross  validation  trials. 
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its  estimated  MSDE  is  the  sum  of  its  estimated  discriminant  bias  squared  and  its  estimated  discriminant 
variance: 


MSDE[gi{«, . «;f},G(0),A]  = 

^DBias  [0  I  {ill . ii|f},G(0),  A]^  +  DVar[01  {ill, ...  ,;;if}.G(0),  A]  (8.5) 


We  use  these  estimators  to  assess  and  compare  different  ciassifier/Ieaming  strategy  combinations  across  a 
series  of  2-fold  cross  validation  trials.  We  often  use  the  term  “empirical"  when  referring  to  our  estimates 
(e.g.,  the  term  “empirical  MSDE”  refers  to  the  estimated  MSDE  defined  above). 

8.2.3  Graphical  Statistical  Summaries 

We  display  single  and  multi-trial  statistics  based  on  the  estimators  described  above  using  two  simple  graphics. 
Both  graphics  illustrate  the  set  of  empirical  error  rates  obtained  over  a  series  of  2-fold  cross  validation  trials, 
and  one  can  be  used  to  characterize  the  result  of  a  single  trial. 

The  first  graphic  is  the  box  plot  [Hi,  ch.  2],  which  is  described  in  detail  in  appendix  C.  In  brief  (see, 
for  example,  figure  8.3  on  page  226),  the  box  of  each  plot  has  vertical  extrema  that  match  the  first  and  third 
quartiles  of  the  ranked  empirical  error  rates;  the  horizontal  line  dividing  the  box  delineates  the  median  error 
rate;  the  inner  and  (if  shown)  outer  “T”-shaped  “fences”  of  each  plot  depict  the  nominal  lower  bound  of 
the  first  quartile  and  nominal  upper  bound  of  the  fourth  quartile,  given  the  ranked  empirical  error  rates.  Any 
extreme  first/fourth  quartile  values  falling  beyond  the  outer  fence(s)  are  plotted  as  dots.  The  box  plot  therefore 
displays  the  results  of  all  trials,  emphasizing  the  median  empirical  error  rate  and  a  quartile  partitioning  of  the 
results. 

The  second  graphic  we  use  is  the  familiar  whisker  plot.  In  the  case  of  a  single  trial  (see,  for  example, 
figure  8.8  on  page  231),  the  dot  of  the  plot  delineates  the  estimated  error  rate  of  (8.1),  and  the  upper  and 
lower  fences  represent  the  upper  and  lower  bounds  of  a  95%  confidence  interval  about  this  estimate.  The 
computation  of  this  confidence  interval  is  described  above  in  section  8.2.1.  In  the  case  of  multiple  trials  (see, 
for  example,  figure  8.3  on  page  226),  the  dot  of  the  plot  delineates  the  average  estimated  error  rate  of  (8.2), 
and  the  fences  represent  one  upper  and  one  lower  standard  deviation  (derived  from  (8.4) )  about  the  average 
estimated  error  rate. 
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Figure  8.2:  Parameters  or  weights  of  the  logistic  linear  classifier  after  learning  the  DB I  database’s  benchmark 
training  sample  differentially.  Dark  weights  are  negative  and  detect  dark  regions  common  to  training 
examples  of  the  digit  with  which  they  are  associated;  light  weights  are  positive  and  detect  dark  regions 
common  to  training  examples  of  of/ter  digits. 


8.3  Compressing  the  Data  to  Improve  Generalization 

Figure  8.2  illustrates  the  parameters  (or  weights)  of  a  linear  classifier  with  10  logistic  linear  discriminant 
functions  of  the  form  described  in  section  7.2.2.  The  classifier  has  2570  total  parameters  (257/digit)*  and  it 
learns  the  benchmark  training  sample  differentially.  When  tested  on  the  benchmark  test  sample,  it  exhibits 
a  2.7  (+l.4/-l.3)%  empirical  error  rate.*  Each  weight  display  in  figure  8.2  corresponds  to  the  discriminant 
function  for  the  digit  beneath  the  display.  Dark  pixels  in  the  display  represent  negative  weights,  and  light 
pixels  represent  positive  weights.  The  far-left  column  of  each  display  contains  only  one  vertically-centered 
pixel.  This  pixel  represents  the  “bias”  parameter  corresponding  to  the  unit-value  element  prepended  to  X 
in  order  to  form  the  augmented  feature  vector  of  (7.2).  The  gray  shade  of  the  far-left  pixel  column  represents 
the  value  zero  (for  reference).  A  dark  (negative)  weight  corresponds  to  a  region  that  is  typically  dark  (-1) 
in  the  training  examples  of  the  digit  with  which  the  weight's  discriminant  function  is  associated.  A  light 
(positive)  weight  corresponds  to  a  region  that  is  typically  dark  in  the  training  examples  of  any  digit  with 
which  the  weight’s  discriminant  function  is  not  associated.  For  example,  a  diagonally-skewed  dark  image  of 
the  digit  zero  is  clearly  visible  in  the  weight  display  for  the  digit  zero  discriminant  function.  Likewise,  a  dark 

^There  are  C  =  10  discriminant  functions,  and  the  augmented  feature  vector  has  A/  +  I  =  257  elements.  Therefore  the  classifier 
has  10  ■  257  =  2.570  total  parameters. 

•'Unless  otherwise  noted,  error  rates  are  given  with  95%  confidence  intervals.  These  intervals  are  computed  on  the  assumption  that 
the  empirical  test  sample  error  rate  is  binomially  distributed  (62|. 
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Figure  8.3;  Left:  Te.st  sample  classification  summaries  for  the  2570-parameter  logistic  linear  classifier 
employing  differential  learning  ( A.^  )  and  two  forms  of  probabilistic  learning  ( Ap ).  The  summaries  are 
based  on  25  independent  trials.  In  each  trial,  training  examples  are  drawn  randomly  from  the  set  of 
1 200  images  with  probability  5 ;  those  not  chosen  for  training  form  the  test  sample.  The  box  plots  arc 
a  non-parametric  depiction  of  the  empirical  lest  sample  error  rate’s  distribution  over  the  25  trials;  the 
whisker  plots  depict  the  average  empirical  test  sample  error  rate  plus  and  minus  one  standard  deviation, 
thereby  characterizing  each  classifier’s  MSDE.  Right;  The  difference  between  the  probabilistically-generated 
models’  empirical  error  rates  and  the  differentially-generated  model’s  rate  on  a  trial-by-trial  basis.  These 
box  plots  show  that  differential  learning  doesn’t  always  produce  the  classifier  with  the  lowest  empirical  error 
rate;  this  is  because  the  hypothesis  class  has  excessive  functional  complexity  for  the  task. 


image  of  the  digit  one  is  visible  at  the  left  edge  of  the  weight  display  for  the  digit  one  discriminant  function; 
however,  a  light  image  of  the  digit  three  is  also  clearly  visible  in  the  center  of  this  weight  display.  To  a  first 
approximation,  the  di.scriminant  function  for  the  digit  one  therefore  detects  “one  and  not  3’’  images.  Similar 
characteristics  can  be  found  in  all  the  weight  displays,  although  the  representations  tend  to  be  quite  abstract 
to  the  human  eye. 

When  generated  and  tested  with  25  different  random  splits  of  the  DBl  database,  the  2570-parameter 
differentially-generated  logistic  linear  classifier  exhibits  a  median  empirical  test  sample  error  rate  of  3.7%. 
Probabilistically-generated  variants  (both  MSE  and  Kullback-Leibler  (CE)  objective  functions)  exhibit  a 
slightly  higher  median  rate  of  4.2%.  Figure  8.3  (left)  displays  the  25  trial  empirical  test  sample  error  rate 
statistics  for  the  three  objective  functions.  The  results  are  shown  in  box  plot  [131,  ch.  2]  (see  appendix  C) 
and  whisker  plot  statistical  summaries.  The  right-hand  side  of  figure  8.3  compares  the  two  probabilistic 
learning  strategies  with  the  differential  strategy  on  a  trial-by-trial  basis.  Thsse  box  plots  summarize  the 
difference  between  the  the  MSE/CE-generated  classifiers’  and  the  CFM-generated  classifier’s  empirical 
lest  sample  error  rates  for  each  of  ‘.he  25  trials.  Positive  values  indicate  that  the  differentially-generated 
classifier  exhibits  a  lower  empirical  test  sample  error  rate  than  the  probabilistically-generated  one  for  the 
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Figure  8.4:  The  distribution  of  parameter  values  in  the  2S7-parameter  logistic  linear  discriminant  function 
representing  the  digit  "3”  (cf.  figure  8.2).  The  parametric  entropy  of  these  weights  is  2.39,  corresponding  to 
the  relatively  low  variance  in  the  distribution.  The  parametric  entropy  of  all  weights  in  the  2570-parameter 
model  is  2.30. 


trial;  negative  values  indicate  that  the  differentially-generated  classifier  exhibits  a  higher  empirical  test 
sample  error  rate  than  the  probabilistically-generated  one  for  the  trial.  From  figure  8.3  (left)  we  see  that 
the  differentially-generated  model's  discriminant  bias,  indicated  by  its  average  empirical  test  sample  error 
rate,  is  slightly  lower  than  the  probabilistically-generated  models’.  This  is  also  evident  in  the  trial-by-trial 
statistics  on  the  right  side  of  the  figure:  in  3/4  of  the  trials,  the  differential  model  exhibits  a  lower  error  rate 
than  its  MSE-generated  counterpart;  in  2/3  of  the  trials,  the  differential  model  exhibits  a  lower  error  rate  than 
its  CE-generated  counterpart.  The  discriminant  variance  of  the  classifiers  produced  by  differential  learning 
and  both  probabilistic  learning  procedures  is  indicated  by  the  vertical  span  of  their  respective  whisker  plots. 
Figure  8.3  (left)  indicates  that  the  differentially-generated  model’s  discriminant  variance  is  approximately 
the  same  as  the  probabilistically-generated  models’. 

Figure  8.3  illustrates  that  the  differentially-generated  model’s  empirical  MSDE  (as  indicated  by  the 
whisker  plot)  is  not  significantly  lower  than  the  probabilistically-generated  models’.  This  is  because  the 
hypothesis  class  (i.e.,  the  2570-parameler  logistic  linear  discriminator)  has  excess  functional  complexity  for 
the  task.  Figure  8.2  helps  to  explain  why  this  is  so.  The  weights  of  the  figure  are  blurred  looking  because 
the  classifier  employs  weight  smoothing  (described  in  section  M.2)  during  learning.  This  is  done  in  order  to 
minimize  the  parametric  entropy  (definition  M.  1 )  of  the  classifier’s  weight  vector,  an  empirical  measure  that 
we  use  to  gauge  the  weight  vector’s  information  content.  Weight  smoothing  therefore  reduces  the  classifier’s 
discriminant  variance,  since  only  the  information  essential  to  learning  is  retained.  Figuie  8.4  shows  a 
histogram  of  the  weights  in  the  discriminant  function  lor  the  digit  “3"  (again,  the  weights  themselves  are 
shown  in  figure  8.2).  Because  the  classifier  learns  all  of  the  6{X)  training  examples  with  a  large  amount 


Figure  8.5:  The  same  digits  shown  in  figure  8.1,  linearly  compressed  from  256-  to  64-pixel  images. 


Figure  8.6:  Parameters  or  weights  of  the  650-parameter  logistic  linear  classifier  after  learning  the  DBl 
database’s  benchmark  training  sample  differentially. 
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Figure  8.7:  The  distribution  of  parameter  values  in  the  6S-parameter  logistic  linear  discriminant  function 
representing  the  digit  “3”  (cf.  figure  8.6).  The  parametric  entropy  of  these  weights  is  3.46;  the  parametric 
entropy  of  all  weights  in  the  650-parameter  model  Is  3.18,  compared  with  2.30  for  the  2570-parameter  model. 
The  increased  variance  in  these  weights  compared  to  those  in  figures  8.2  and  8.4  reflects  the  classifier’s  lower 
functional  complexity:  each  weight  now  contains  more  information  for  the  discrimination  task. 


of  weight  smoothing^  (k  =  0.128),  the  “3”  discriminant  function’s  weight  vector  has  relatively  low 
parametric  entropy  (2.39).  This  low  entropy  reflects  the  low  variance  in  the  histogram  of  the  weights.  The 
parametric  entropy  of  all  the  classifier’s  weights  is  2.3,  a  relatively  small  value  suggesting  that  256-pixel 
images  can  be  compressed  without  an  appreciable  loss  of  information.  That  is,  we  would  expect  that 
compressing  the  images  would  increase  the  parametric  entropy  of  the  resulting  lower-complexity  classifier, 
and  that  this  increase  would  not  be  so  large  as  to  cause  an  increase  in  the  classifier’s  error  rate.  Our 
expectation  is  based  on  the  notion  that  the  classifier  encodes  all  the  information  required  to  classify  all  the 
training  examples  correctly;  this  information  can  be  measured  in  terms  of  the  total  number  of  bits  necessary 
to  describe  the  classifier’s  discriminator.  As  the  number  of  parameters  in  the  discriminator  is  decreased, 
the  same  amount  of  information  must  be  encoded  with  fewer  parameters,  so  the  parametric  entropy  —  our 
measure  of  the  average  amount  of  information  in  a  single  parameter  of  the  discriminator  —  increases.  We 
remind  the  reader  that  parametric  entropy  is  an  ad-hoc  measure  of  the  information  content  in  a  retinotopic 
parameter  vector  (see  section  M.l .1 ).  Given  this  measure,  we  hypothesize  —  but  have  not  proven  —  that 
there  is  an  upper  bound  on  the  classifier’s  information  capacity,  which  corresponds  to  an  upper  bound  on 
the  parametric  entropy  of  its  weight  vector.  Beyond  this  upper  bound,  the  classifier  fails  to  encode  all  of  the 
information  in  the  training  sample  essential  to  robust  discrimination.  Below  this  upper  bound,  the  classifier 
has  more  than  sufficient  capacity  to  encode  all  the  information  necessary  for  robust  discrimination. 

Our  belief  that  the  classifier  complexity  can  be  reduced  without  an  appreciable  information  loss  is 
validated  by  figure  8.5,  which  shows  the  images  of  figure  8.1  after  they  are  compressed  using  the  procedure 

^The  weight  smoothing  parameier  k  has  a  value  between  zero  and  one  (see  appendix  M).  A  value  of  zero  results  in  no  smoothing:  a 
value  of  one  forces  all  weights  to  have  the  same  value.  From  a  qualitative  perspective,  any  value  of  k  >  0. 1  is  large. 
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described  in  section  M.3.  Tlie  images  are  still  quite  legible  to  the  human  eye,  despite  their  having  one  fourth 
the  number  of  pixels.  Figure  8.6  shows  the  weights  of  a  6S0-parameter  logistic  linear  classifier  after  it  learns 
the  compressed  benchmark  training  sample  differentially.  Again,  the  classifier  learns  with  a  weight  smoothing 
coefficient  of  k  =  0.128.  Figure  8.7  shows  that  the  “3"  discriminant  function’s  parametric  entropy  has 
increased  from  2.39  (for  the  2570-parameter  classifier)  to  3.45,  while  the  entropy  of  all  discriminant  functions 
has  increa.sed  from  2.30  to  3. 1 8.  That  is,  each  weight  in  the  650-paramcter  model  encodes  more  information 
(as  measured  by  the  classifier's  parametric  entropy)  than  each  weight  in  the  2570-parameter  model.  At  the 
same  time,  the  lower-complexity  classifier’s  empirical  benchmark  test  sample  error  rate  has  dropped  from 
2.7  (+l.4/-l.3)%  to  1.3  (+l.l/-0.9)%  —  a  52%  reduction,  which  indicates  the  improved  generalization  of  the 
lower-complexity  classifier. 

Figure  8.8  compares  the  empirical  benchmark  test  sample  error  rates  for  the  650-parameter  logistic  linear 
classifier  employing  differential  learning  with  those  of  two  probabilistically-generated  counterparts.  The 
differentially-generated  model's  1.3  (+l.l/-0.9)%  error  rate  is  approximately  one  half  the  MSE-generated 
model’s  rate  of  2.7  (+l.4/-l.3)%,  and  it  is  approximately  one  third  the  CE-generated  model’s  rate  of  4.0 
(+l.7/-1.6)%.  Also  shown  are  the  benchmark  test  sample  error  rates  of  (he  best  independently-developed 
linear  classifier  and  the  best  independently-developed  non-linear  classifier.  Both  of  these  independent 
results  are  described  in  (161.  These  classifiers  learn  a  subset  of  the  un-compressed  benchmark  training 
sample,  which  has  had  unrepresentative  examples  removed  by  a  culling  procedure  described  in  [16].  The 
independently-developed  linear  classifier  shown  learns  the  culled  training  sample  using  a  discriminative 
learning  procedure  also  described  in  [16];  it  exhibits  an  empirical  benchmark  test  sample  error  rate  of  3.2 
(+1.6/- 1.4)%.  The  independently-developed  non-linear  classifier  shown  learns  the  culled  training  sample 
after  all  its  examples  have  been  heavily  filtered  using  a  Gaussian  smoothing  kernel;  it  exhibits  an  error  rate 
of  0.3  (+0.6/-0.3)%. 


8.4  Recognition  Results 


Figure  8.8  shows  that  the  differentially-generated  logistic  linear  classifier  makes  fewer  recognition  errors 
on  the  benchmark  test  sample  than  all  but  the  best  independently-developed  non-linear  classifier.  Based 
on  this  single  trial,  the  differentially-generated  linear  model  is  not  significantly  better  than  the  other  linear 
models,  nor  is  it  significantly  worse  than  the  non-linear  model.  Since  the  independent  results  are  based  on  a 
single  trial  using  the  benchmark  test  sample,  the  only  multi-trial  comparisons  we  can  make  are  with  our  own 
probabilistically-generated  models." 


^Geman  et  al  have  nin  mulliple  independent  leaming/tesiing  trials  using  random  data  splits  (41).  but  the  errot  rates  of  their 
probabilistically-generated  classifiers  are  considerably  higher  ifitn  those  of  our  probabilistic  controls.  We  therefore  restrict  our 
multi-trial  comparisons  to  our  own  esperiments  in  order  to  give  probabilistic  learning  a  fair  evaluation. 


8.4  Results 


231 


<0u 


6% 

5% 

4% 

3% 

2% 

1% 

0% 


Ap(CE) 


Ap(MSE) 


Aa(Cfm) 


best  independent 
linear  result 


best  independent 
non-linear  result 


1 


Figure  8.8;  Test  sample  empirical  error  rates  with  95%  confidence  intervals  for  the  DBI  database’s 
benchmark  split  of  training/testing  examples.  The  differentially-generated  logistic  linear  classifier  (  A^  )  is 
shown  with  two  probabilistically-generated  controls  ( Ap),  the  best  independent  linear  result  [16],  and  the 
best  independent  non-linear  result  [16]. 


Figure  8.9:  Left:  Test  sample  classification  summaries  for  the  650-parameter  logistic  linear  classifier 
employing  differential  learning  ( A^ )  and  two  forms  of  probabilistic  learning  ( Ap ).  The  summaries  are 
based  on  25  independent  trials  in  which  the  DBI  database  is  randomly  partitioned  into  training  and  test 
samples,  each  containing  approximately  600  examples.  The  box  plots  are  a  non-parametric  depiction  of 
the  empirical  test  sample  error  rate’s  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average 
empirical  test  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier's 
MSDE.  Right;  The  difference  between  the  probabilistically-generated  models’  empirical  error  rate  and  the 
differentially-generated  model’s  rate  on  a  trial-by-trial  basis.  An  increase  of  2%  represents  a  doubling  of 
the  differentially-generated  classifier’s  median  empirical  error  rate.  These  box  plots  show  that  differential 
learning  always  produces  the  classifier  with  the  lowest  empirical  error  rate.  Moreover,  the  lower-complexity 
differentially-generated  logistic  linear  classifier  generalizes  better  than  all  of  the  higher-complexity  classifiers 
in  figure  8.3. 
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Estimated  MSDE  (25  trials) 

Classifier 

Learning  Strategy 

A  A  (CFM) 

Ap(MSE) 

Ap(CE) 

2570-Parameter 

1.6  X  10-’ 

1.9  X  10-’ 

1.9  X  10"' 

650-Parameter 

4.7  X  10-‘‘ 

1.2  X  10-’ 

1.8  X  10-^ 

Table  8.1;  Estimated  MSDE  for  the  high  and  low-complexity  logistic  linear  classifiers  employing  differential 
learning  ( )  via  the  CFM  objective  function  and  probabilistic  learning  ( Ap )  via  the  MSE  and  CE  objective 
functions.  Estimates  are  based  on  25  independent  leaming/testing  trials.  Reducing  the  classifier's  complexity 
by  compressing  the  digit  images  has  the  beneficial  effect  of  reducing  the  classifier’s  estimated  MSDE.  The 
reduction  is  most  pronounced  in  the  differentially-generated  model,  as  predicted  by  theory. 


8.4.1  Experiments  with  the  Logistic  Linear  Hypothesis  Class 

Figure  8.9  compares  the  6S0-parameter  logistic  linear  classifier  employing  differential  learning  with 
controls  that  employ  probabilistic  learning  (MSE  and  CE  objective  functions).  These  comparisons  are 
for  the  same  25  random  splits  of  the  DBI  database  used  for  the  257(^paranieter  classifier  tests  shown  in 
figure  8.3.  All  the  low-complexity  models  exhibit  lower  empirical  MSDE  than  their  higher-complexity 
counterparts,  as  indicated  by  the  lower  average  error  rates  and  slightly  reduced  whisker  plot  spans  of 
figure  8.9  (left)  versus  figure  8.3  (left).  Table  8.1  summarizes  the  estimated  MSDE  for  the  high  and 
low  complexity  classifiers,  given  the  three  learning  strategies.  The  differentially-generated  model  exhibits 
the  largest  reduction  in  MSDE:  its  average  empirical  test  sample  error  rate  drops  from  3.9%  to  2.1%, 
while  the  standard  deviation  of  this  statistic  drops  from  0.71%  to  0.57%.  Assuming  a  Bayes  error  rate 
of  zero  for  the  DBI  database,  the  2570-parameter  differentially-generated  model’s  empirical  MSDE  is,  by 
(3.9),  1.6  X  10"-^ ;  the  650-parameter  differentially-generated  model’s  empirical  MSDE  is  4.7  x  10“^  — 
approximately  one  fourth  that  of  the  higher-complexity  model.  Thus,  reducing  the  classifier’s  complexity 
by  compressing  the  image  feature  vector  by  a  factor  of  4  :  1  reduce;  the  differentially-generated  model’s 
MSDE  by  approximately  the  same  ratio.  For  probabilistic  learning  via  MSE,  the  higher-complexity  model’s 
MSDE  is  1.9  X  I0~-’,  and  the  lower-complexity  model’s  is  1.2  x  10“^.  For  probabilistic  learning  via  the 
Kullback-Leibler  information  distance  (CE),  the  higher-complexity  model’s  MSDE  is  1.9  x  IO~-\  and  the 
lower-complexity  model’s  is  virtually  unchanged  at  1.8  x  I0~^.  Figure  8.9  (right)  also  shows  that  the 
reduced-complexity  differentially-generated  classifier  consistently  exhibits  a  lower  empirical  test  sample 
error  rate  than  its  probabilistic  counterparts.  The  MSE-generated  classifier’s  error  rate  is  typically  1 .3% 
greater  than  (or  1.65  times)  the  CFM-generated  classifier’s.  The  CE-generated  classifier's  error  rate  is 
typically  2.0%  greater  than  (or  two  times)  the  CFM-generated  classifier’s.  Thus,  reducing  the  classifier’s 


Resultx 


233 


Figure  8. 10:  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650-parameter 
logistic  linear  classifier  as  it  learns  the  benchmark  training  sample  differentially.  The  classifier's  empirical 
test  sample  error  rate  is  1.3  (+l.l/-0.9)%  after  157  learning  epochs. 

complexity  reduces  the  MSDE  of  all  the  classifiers,  but  the  reduction  realized  by  the  differentially-generated 
model  is  significantly  greater. 

Comparing  Learning  Strategies  for  the  Benchmark  Training/Test  Sampie 

In  simple  terms,  the  reduced-complexity  difTerenttally-generated  model  realizes  the  biggest  reduction  in 
MSDE  because  (as  proven  in  chapter  3)  differential  learning  I )  is  asymptotically  efficient,  regardless  of  the 
choice  of  hypothesis  class,  and  2)  it  requires  the  minimum-complexity  hypothesis  class  necessary  for  Bayesian 
discrimination.  Figures  8. 10  —  8. 1 5  demonstrate  these  characteristics  for  the  logistic  linear  classifier  learning 
the  benchmark  training  sample.  Figure  8.10  shows  both  the  training  sample  (gray)  and  test  sample  (black) 
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Figure  8.11:  The  6S0-paranieter  logistic  linear  classifier’s  output  state  — as  projected  onto  reduced 
discriminator  output  space  —  after  learning  the  600  benchmark  training  examples  differentially.  This  output 
state  corresponds  to  the  parameters  shown  in  figure  8.6.  Note  how  most  of  the  test  examples  (black  triangles) 
and  all  of  the  training  examples  (gray  dots  underneath  the  test  examples)  are  aligned  with  the  contours  of 
constant  CFM  on  reduced  discriminator  outout  space.  These  constant  CFM  contours  are  parallel  to  the 
reduced  discriminant  boundary  (definition  S.S)  —  a  necessary  condition  for  efficient  learning. 


Training  History  of  Error  Statistics  (MSE) 
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Figure  8. 1 2:  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650-parameter 
logistic  linear  classifier  as  it  ieams  the  benchmark  training  sample  probabilistically  (MSE  objective  function). 
The  classifier’s  empirical  test  sample  error  rate  is  2.7  (+1 .4/- 1 .3)%  after  159  learning  epochs. 


empirical  error  rates  as  differential  learning  progresses  through  approximately  160  learning  epochs.  The 
objective  function’s  value  is  plotted  as  a  fight  gray  background  in  the  figure.  Ninety-five  percent  confidence 
intervals  on  the  error  rates  are  plotted  at  periodic  intervals.  From  these  one  can  see  that  the  training  sample 
error  rate  is  representative  of  the  test  sample  error  rate  up  to  ninety  differential  learning  epochs.  Beyond  this 
point  the  empirical  training  sample  error  rate  is  significantly  lower  than  the  test  sample  error  rate.  During 
differential  learning,  (/'  is  reduced  from  a  value  of  0.48  at  epoch  zero  to  0.35  beyond  epoch  100.  The  final 
output  state  of  the  logistic  linear  classifier  is  shown  on  reduced  discriminator  output  space  (definition  5.2)  in 
figure  8.11.  Test  examples  ate  shown  as  black  triangles,  and  training  examples  are  shown  as  gray  dots.  After 
approximately  160  epochs,  all  the  training  examples  lie  parallel  to  the  CFM  =  0.90  contour,  as  do  most  of 
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F~  Reduced  Discriminant  Continuum  (MSE) 


Figure  8.13:  The  650-paranieter  logistic  linear  classifier’s  output  state  — as  projected  onto  reduced 
discriminator  output  space  —  after  it  attempts  to  learn  the  600  benchmark  training  examples  probabilistically. 
Note  that  the  MSE-generated  classifier  cannot  learn  all  the  training  examples,  given  the  low-complexity 
logistic  linear  hypothesis  class. 


the  test  examples.  The  remaining  test  examples  —  ones  that  are  hard  for  the  classifier  to  discriminate  — 
fall  close  to  the  reduced  discriminant  boundaty  (definition  5.S).  Owing  to  the  monotonic  nature  of  the  CFM 
objective  function,  these  examples  are  also  parallel  to  the  reduced  discriminant  boundary,  and  most  of  them 
are  on  the  correct  side  of  the  boundary. 

Figure  8.12  shows  the  empirical  training  and  test  sample  error  rates  for  the  logistic  linear  classifier  that 
learns  the  benchmark  training  sample  probabilistically  via  the  MSE  objective  function.  Its  empirical  training 
sample  error  rate  remains  representative  of  the  test  sample  error  rale  through  all  160  epochs,  although  both 
error  rates  are  higher  than  those  for  the  difreientially-geneiated  model.  Unfortunately  the  6S0-parameter 
classifier  that  learns  probabilistically  with  a  weight  smoothing  coefficient  of  k  =  .128  has  insufficient 
functional  complexity  to  learn  the  training  sample  as  well  as  its  differentially-generated  counterpart  (cf. 
figure  8.13  versus  figure  8.11).  As  a  result,  the  non-monotonic  nature  of  the  MSE  objective  function 
leads  the  classifier  to  learn  the  majority  of  easy  examples  with  high  confidence  (i.e.,  to  minimize  the  MSE 
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F”  Training  History  of  Error  Statistics  (CE) 
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Figure  8.14;  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650-parameter 
logistic  linear  classifier  as  it  learns  the  benchmark  training  sample  probabilistically  (CE  objective  function). 
The  classifier's  empirical  test  sample  error  rate  is  4.0  (+1.7/- 1.6)%  after  159  learning  epochs. 


between  its  outputs  and  the  easy  examples’  binary  target  vectors),  while  it  fails  to  learn  the  minority  of 
hard  examples.  This  phenomenon  is  clearly  depicted  in  figure  8.13.  'The  MSE-generated  classifier’s  training 
and  test  examples  are  aligned  with  constant  contours  of  MSE;  the  harder  the  example,  the  higher  the  MSE. 
However  the  contours  of  constant  MSE  are  not  parallel  to  the  reduced  discriminant  boundary  (i.e.,  MSE  is 
not  a  monotonic  objective  function  —  definition  5.10).  As  a  result,  a  larger  proportion  of  hard  examples  fall 
on  the  incorrect  side  of  the  boundary,  and  the  classifier  exhibits  higher  empirical  training  euid  test  sample 
error  rales  than  its  difTerentially-generated  counterpart  (see  figure  8.8,  page  231). 

'The  classifier  that  employs  probabilistic  learning  via  the  Kullback-Leibler  information  distance  (CE 
objective  function)  exhibits  the  same  inefficient  behavior  that  die  MSE-generaied  classifier  exhibits. 
Figure  8.14  shows  that  the  CE-generated  classifier’s  empirical  training  sample  error  rate  remains  representative 
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F  Reduced  Discriminant  Continuum  (CE)  FF 


Figure  8.15:  The  650-parameter  logistic  linear  classifier’s  output  state  — as  projected  onto  reduced 
discriminator  output  space — after  it  attempts  to  leam  the  600  benchmark  training  examples  probabilistically. 
Note  that  the  Kullback-Leibler  (CE)  generated  classifier  cannot  leam  all  the  training  examples,  given  the 
low-complexity  logistic  linear  hypothesis  class. 


of  the  test  sample  error  rate  through  all  160  epochs,  although  both  error  rates  are  higher  than  those  for  the 
differentially-generated  model.  Again,  the  650-parameter  classifier  that  learns  probabilistically  with  a  weight 
smoothing  coefficient  of  k  =  .  1 28  has  insufficient  functional  complexity  to  leam  the  training  sample  as 
well  as  its  differentially-generated  counterpart.  As  a  result,  the  CE  objective  function  induces  the  classifier  to 
leam  the  easy  examples  with  high  confidence  while  it  fails  to  leam  the  minority  of  hard  examples.  Since  the 
contours  of  constant  CE  have  even  more  curvature  than  those  of  the  MSE  objective  function  (i.e.,  CE  is  even 
less  monotonic  than  the  MSE  objective  function  — cf.  figures  8.15  versus  8.13),  a  proportionally  greater 
number  of  hard  examples  fall  on  the  incorrect  side  of  the  reduced  discriminant  boundary.  This  is  reflected 
in  the  CE-generated  classifier’s  elevated  empirical  training  and  test  sample  error  rates  (again,  see  figure  8.8, 
page  231). 
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8.4.2  Experiments  with  Alternative  Hypothesis  Classes 

The  inefficiency  of  probabilistic  learning  impacts  the  classifier's  MSDE  to  a  degree  that  varies  with  the 
choice  of  hypothesis  class.  If  the  hypothesis  class  is  a  reasonably  good  approximation  to  a  proper  parametric 
model  of  the  feature  vector  (definition  3. 1 3,  page  63),  the  MSDE  of  classifiers  produced  from  the  hypothesis 
class  by  probabilistic  learning  might  be  lower  than  the  MSDE  of  their  differentially-generated  counterparts 
for  small  training  sample  sizes.  However,  if  the  hypothesis  class  is  a  distinctly  improper  parametric  model 
of  the  feature  vector,  the  MSDE  of  the  probabilistically-generated  classifier  will  be  substantially  higher  than 
that  of  the  differentially-generated  classifier  (recall  that  chapter  4  illustrates  this  phenomenon  for  a  simulated 
random  feature  variable). 

The  Linear  Hypothesis  Class 

Figures  8. 16  —  8.20  provide  a  detailed  comparison  of  probabilistic  and  differential  learning  when  the  classifier 
is  generated  from  a  650-parameter  linear  hypothesis  class  of  the  form  described  in  section  7.2. 1 .  Learning 
with  the  linear  hypothesis  class  proceeds  faster  than  it  does  with  the  logistic  linear  hypothesis  class  for  the 
reasons  outlined  in  section  S.S.  I  (all  learning  parameters  for  the  linear  hypothesis  class  are  identical  to  those 
for  the  logistic  linear  hypothesis  class).  This  is  evident  from  the  history  of  the  benchmark  empirical  training 
and  test  sample  error  rates  shown  in  figure  8.16.  As  with  the  logistic  linear  hypothesis  class,  the  training 
sample  error  rate  is  significantly  lower  than  the  test  sample  error  rate  beyond  a  certain  point  in  the  differential 
learning  trial  (40  epochs  in  this  case).  The  linear  classifier’s  output  state  after  75  differential  leamin'g  epochs 
is  shown  in  figure  8.17.  Many  training  and  test  examples  fall  outside  the  unit  square  on  reduced  discriminator 
output  space  because  the  linear  classifier’s  outputs  are  not  bounded.  That  is,  rather  than  (0,  Ij^. 

Nevertheless,  we  see  the  same  general  trends  displayed  by  the  logistic  linear  classifier  in  figure  8.1 1:  All 
of  the  training  examples  are  learned,  since  they  engender  discriminant  differentials  that  are  greater  than 
^  ^learned  ^  0.2 .  Most  of  the  test  examples  also  exhibit  relatively  large  positive  discriminant  differentials. 
The  harder  test  examples  with  negative  or  relatively  small  positive  discriminant  differentials  are  parallel  to 
the  reduced  discriminant  boundary.  The  differentially-generated  linear  classifier’s  benchmark  empirical  test 
sample  error  rate  is  2.3  (+1.4/- 1.3)%  — not  significantly  higher  than  the  differentially-generated  logistic 
linear  classifier’s  rate. 

Figure  8. 1 8  shows  the  benchmark  empirical  training  and  test  sample  error  rates  of  the  linear  classifier 
during  probabilistic  learning  via  MSE.  The  training  sample  error  rate  remains  representative  of  the  test 
sample  error  rate  throughout  the  learning  trial.  Indeed,  the  error  rates  converge  to  their  final  values  after 
approximately  35  epochs.  The  final  empirical  test  sample  error  rate  is  5.0  (+1.9/-1.8)%  — more  than 
twice  the  differentially-generated  model’s  rate.  Being  an  improper  parametric  model  of  X  and  having 
insufficient  functional  complexity  to  model  the  empirical  a  posteriori  class  probabilities  of  X  accurately,  the 
probabilistically-generated  linear  model  minimizes  the  MSE  between  its  output  state  and  the  training  sample 
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Figure  8. 16:  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650-parameter 
linear  classifier  as  it  learns  the  benchmark  training  sample  differentially.  Learning  is  predictably  faster,  albeit 
somewhat  less  stable  than  it  is  with  the  logistic  linear  hypothesis  class.  The  classifier's  empirical  test  sample 
error  rate  is  2.3  (+ 1 .4/- 1 .2)%  after  75  learning  epochs.  The  critical  reader  should  note  that  the  empirical  test 
sample  error  rate  settles  at  this  value  beyond  75  learning  epochs. 


target  vectors  as  best  it  can.  Figure  8.19  shows  the  classifier's  p  learning  output  state  as  projected  onto 
reduced  discriminator  output  space.  Note  that  both  training  and  tesi  amples  are  aligned  with  the  contours  of 
constant  MSE.  This  is  particularly  clear  for  the  misclassifled  examples,  which  fall  on  the  incorrect  side  of  the 
reduced  discriminant  boundary.  All  of  the  training  examples  and  most  of  the  test  examples  that  fall  inside  the 
MSE  =  0.36  contour  are  learned  or  correctly  classified  by  the  differentially-generated  model  in  figure  8.17. 
Figure  8.20  shows  that  the  probabilistically-generated  linear  classifier's  error  rate  is  consistently  higher  than 
the  differentially-generated  model's  across  the  25  random  splits  of  the  DBI  database.  Indeed,  the  MSE- 
generated  model's  error  rate  is  typically  more  than  three  times  the  CFM-generated  model's.  Experiments 
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Figure  8.17:  The  6S0-parameter  linear  classifier’s  output  state  — as  projected  onto  reduced  discriminator 
output  space  —  after  learning  the  600  benchmark  training  examples  differentially.  Since  the  linear  classifier's 
outputs  are  not  bounded  on  (0, 1  ],  significant  fractions  of  the  training  and  test  samples  engender  output  stales 
that  fall  outside  the  unit  hypercube  [0,  .  Such  examples  therefore  fall  outside  the  unit  square  in  this 

figure. 
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Figure  8.18:  The  empirical  error  rates  (training  sample  in  gray  and  test  sample  in  black)  for  the  650-parameter 
linear  classiHer  as  it  learns  the  benchmark  training  sample  probabilistically  (MSE  objective  function).  The 
classirier's  empirical  test  sample  error  rate  is  5.0  (+1.9/-l.8)%  after  60  learning  epochs.  No  further  learning 
occurs  beyond  60  epochs. 

with  the  CE  objective  function  ate  not  possible  because  the  linear  classiHer’s  outputs  are  unbounded;  this 
violates  the  conditions  necessary  for  learning  via  the  CE  objective  function  (see  section  2.3.2). 

The  Modified  Gaussian  RBF  Hypothesis  Class 

Figure  8.21  summarizes  the  results  of  25  learning  trials,  given  a  650-parameter  modiHed  Gaussian  RBF 
hypothesis  class  of  the  form  described  in  appendix  K.  Results  are  summarized  for  differential  learning 
via  the  CFM  objective  function  and  probabilistic  learning  via  the  MSE  and  CE  objective  functions.  These 
comparisons  are  for  the  same  25  random  splits  of  the  DB I  database  used  for  all  the  previous  multi-trail 
experiments.  The  modified  RBF  hypothesis  class  is  an  improper  parametric  model  of  X .  In  addition,  the 
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fS"  Reduced  Discriminant  Continuum  (MSE)  F? 


Figure  8.19;  The  650-parameter  linear  classifier’s  output  state  —  as  projected  onto  the  reduced  discriminator 
output  space  — after  it  attempts  to  learn  the  600  benchmark  training  examples  probabilistically  (MSE 
objective  function).  Note  that  the  MSE-generated  classifier  cannot  learn  all  of  the  training  examples,  given 
the  low-complexity  linear  hypothesis  class. 

hypothesis  class  cannot  form  the  same  piece-wise  linear  boundaries  on  "X,  that  the  linear  hypothesis  classes 
can.’  As  a  result,  the  modified  RBF  hypothesis  class  has  insufficient  functional  complexity  to  match  the 
linear  hypothesis  class  error  rates.  This  is  evident  in  the  3.8%  median  error  rate  of  the  differentially-generated 
classifier,  which  is  approximately  twice  thediFferentially-generated  linear  classifiers’  median  empirical  error 
rates.  The  probabilistically-generated  RBF  classifiers  fare  worse  in  comparison  to  their  linear  counterparts; 
their  median  empirical  error  rates  increase  to  12%  (MSE)  and  10%  (CE).  The  probabilistically-generated 
RBF  classifiers  consistently  exhibit  empirical  error  rates  that  are  between  two  and  four  times  the  median  rate 
for  the  differentially-generated  classifier. 

8.4.3  Interpretation  of  Results 


’a  formal  proof  of  this  assertion  would  require  a  number  of  pages,  and  would  not  lend  anything  of  substance  to  our  line  of  argument. 
Therefore,  we  ask  the  reader  to  accept  this  assertion  on  faith. 
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Figure  8.20:  Left:  Test  sample  ciassification  summaries  for  the  650-parameter  linear  classifier  employing 
dirfeiential  learning  (  Aa  )  and  the  MSE  form  of  probabilistic  learning  ( Ap ).  The  summaries  are  based  on 
25  independent  trials  in  which  the  DB I  database  is  randomly  partitioned  into  training  and  test  samples,  each 
containing  approximately  600  examples.  The  box  plots  are  a  non-parametric  depiction  of  the  empirical  test 
sample  error  rate’s  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average  empirical  lest  sample 
error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier's  MSDE.  Right: 
The  difference  between  the  probabilistically-generated  models'  empirical  error  rate  and  the  differentially- 
generated  model’s  rate  on  a  trial-by-trial  basis.  This  box  plot  shows  that  differential  learning  always  produces 
the  classifier  with  the  lowest  empirical  error  rate.  Probabilistic  learning  engenders  linear  classifiers  with  error 
rates  that  are  typically  more  than  three  times  those  of  the  differentially-generated  linear  classifier. 
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Figure  8.21:  Left:  Test  sample  classification  summaries  for  the  650-parameter  modified  RBF  classifier 
employing  differential  learning  (  Aa  )  and  two  forms  of  probabilistic  learning  ( Ap ).  The  summaries  are 
based  on  25  independent  trials  in  which  the  DB  I  database  is  randomly  partitioned  into  training  and  test 
samples,  each  containing  approximately  600  examples.  The  box  plots  are  a  non-parametric  depiction  of 
the  empirical  test  sample  error  rate’s  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average 
empirical  test  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier's 
MSDE.  Right:  The  difference  between  the  probabilistically-generated  models'  empirical  error  rate  and  the 
differentially-generated  model's  rate  on  a  trial-by-trial  basis.  These  box  plots  show  that  differential  learning 
always  produces  the  classifier  with  the  lowest  empirical  error  rate,  as  is  the  case  with  both  650-parameter 
linear  hypothesis  classes.  Probabilistic  learning  engenders  RBF  classifiers  with  error  rates  that  are  typically 
two  and  a  half  to  three  times  those  of  the  differentially-generated  RBF  classifier. 
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Estimated  DBias,  DVar,  and  MSDE  (25  trials) 


Hypothesis 

Learning  Strategy 

Class 

!  A  A  (CFM) 

Ap(MSE) 

Ap(CE) 

DBias 

2.1  X  10-2 

6.3  X  10-2 

N/A 

Linear 

DVar 

6.3  X  10"* 

1.2  X  lO-* 

N/A 

MSDE 

4.9  X  10"^ 

4.1  X  10-^ 

N/A 

DBias 

2.1  X  10-2 

3.4  X  10-2 

Logistic  Linear 

DVar 

3.2  X  I0-* 

5.6  X  I0-* 

8.7  X  10-* 

MSDE 

4.7  X  10-* 

1.2  X  10-^ 

1.8  X  10"^ 

DBias 

4.0  X  10-2 

11.9  X  10-2 

10.1  X  10-2 

Modified  RBF 

DVar 

1.1  X  10-^ 

2.7  X  10-“ 

1.3  X  10-“ 

MSDE 

1.7  X  10-^ 

1.4  X  10-2 

I.Ox  10-2 

Table  8.2:  Estimated  discriminant  bias,  discriminant  variance,  and  MSDE  for  650-parameter  classifiers 
generated  from  the  linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential  learning  ( A  &  ) 
via  the  CFM  objective  function  and  probabilistic  learning  ( Ap )  via  the  MSE  and  CE  objective  functions. 
Estimates  are  based  on  25  independent  leaming/testing  trials  in  which  the  DBI  database  is  randomly 
partitioned  into  training  and  test  samples,  each  containing  approximately  600  examples.  The  differentially 
generated  classsifier's  MSDE  is  O[l/I0]  that  of  its  probabilistically  generated  counterparts’  for  all  three 
hypothesis  classes. 


Table  8.2  summarizes  the  estimated  discriminant  bias,  discriminant  variance,  and  MSDE  of  the  classifiers 
generated  from  the  linear,  logistic  linear,  and  modified  RBF  hypothesis  classes.  Results  are  given  for  each 
learning  strategy,  as  appropriate.  All  values  are  based  on  the  assumption  that  the  Bayes  error  rate  for 
the  DBI  task  is  zero,  given  the  compressed  images  (see  section  8.2).  The  MSE-generated  logistic  linear 
classifier's  empirical  MSDE  is  2.5  times  the  differentially-generated  classifier’s;  the  CE-generated  logistic 
linear  classifier’s  empirical  MSDE  is  4  times  the  differentially-generated  classifier’s.  Most  of  the  increase  is 
due  to  increased  discriminant  bias.  For  the  linear  and  modified  RBF  hypothesis  classes,  the  probabilistically- 
generated  classifiers’  MSDE  is  an  order  of  magnitude  higher  than  the  differentially-generated  classifier’s 
(recall  from  chapter  3  that  a  classifier  with  lower  MSDE  constitutes  a  better  approximation  to  the  Bayes- 
optimal  classifier).  Although  most  of  the  MSDE  increase  is  due  to  increased  discriminant  bias,  an  appreciable 
fraction  of  it  is  due  to  the  increased  discriminant  variance  of  these  hypothesis  classes  when  paired  with 
probabilistic  learning. 

These  findings  are  consistent  with  the  theoretical  predictions  of  chapter  3.  The  non-monotonic  nature  of 
error  measures  clearly  plays  a  role  in  the  inefficient  behavior  of  probabilistic  learning  strategies  and  explains 
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in  geometric  terms  why  minimizing  the  classifier’s  functional  error  does  not  minimize  its  discriminant  error 
—  proofs  of  which  are  given  in  section  3.4.  The  chapter  3  proof  that  differential  learning  produces  the 
relatively  efficient  classifier,  regardless  of  the  choice  of  hypothesis  class,  is  clearly  demonstrated  by  the 
statistics  in  table  8.2. 


8.4.4  Rejecting  Classifications  After  Learning 

Table  8.3  reviews  the  empirical  error  rates  for  the  benchmark  test  sample.  Rates  are  given  for  classifiers 
generated  from  the  linear,  logistic  linear,  and  modified  Guassian  RBF  hypothesis  classes  with  differential 
learning  via  CFM  and  probabilistic  learning  via  MSE  and  CE.  The  logistic  linear  classifiers*  error  rates 
correspond  to  figures  8.11  (CFM),  8.13  (MSE),  and  8.15(CE).  The  linear  classifiers’  error  rates  correspond 
to  figures  8.17  (CFM)  and  8.19  (MSE).  The  modified  Gaussian  RBF  classifiers’  error  rates  correspond  to  the 
text  of  section  8.4.2. 

Table  8.4  shows  the  results  of  rejecting  marginal  classifications  (i.e.,  those  close  to  the  reduced 
discriminant  boundary)  for  each  of  these  classifiers.  As  described  in  section  7.6,  the  classifier  rejects  all 
test  examples  that  generate  a  top-ranked  discriminant  differential  ($(|)(X|0)  that  is  less  than  the  default 
rejection  threshold 


reject  classification  iff 


ytl)  ~  .^12)  ^  ^reitcl'< 
<5,I,(X|R) 


1 

2 


(DIM 


(8.6) 


Note  that  is  shown  diagrammatically  in  figure  D.l.  We  use  this  differential  rejection  threshold  for  the 
probabilistically-generated  classifiers  as  well  as  the  differentially-generated  ones,  since  it  yields  better  results 
than  MSE  or  CE-based  thresholds  (recall  section  7.6). 

'i1ie  differentially  generated  classifiers  consistently  reject  no  more  than  6%  of  the  test  sample  and  exhibit 
no  more  than  a  1%  error  rate  on  the  remaining  (i.e.,  un-rejected)  test  examples.  The  consistency  of  these 
rejection/error  rates  does  not  hold  for  the  probabilistically-generated  classifiers.  Both  probabilistically- 
generated  logistic  linear  classifiers  reject  no  more  than  6%  of  the  test  sample  and  exhibit  no  more  than 
a  2%  error  rate  on  the  remaining  test  examples,  but  the  linear  and  modified  Gaussian  RBF  classifiers’ 
rejection/error  statistics  are  considerably  worse.  The  MSE-generated  linear  classifier  rejects  approximately 
9%  of  the  test  sample,  and  misclassifies  approximately  2%  of  the  remaining  test  examples.  The  MSE- 
generated  RBF  classifier  rejects  approximately  1 5%  of  the  test  sample,  and  misclassifies  approximately  5% 
of  the  remaining  test  examples.  The  CE-generated  RBF  classifier  rejects  approximately  16%  of  the  test 
sample,  and  misclassifies  approximately  3%  of  the  remaining  lest  examples. 
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Benchmark  Empirical  Error  Rates 

Hypothesis 

Learning  Strategy 

Class 

Aa  (CFM) 

Ap(MSE) 

Ap(CE) 

Linear 

2.3  (+l.4/-l.2)% 

5.0(+l.9/-l.8)% 

N/A 

Logistic  Linear 

1.3  (+l.l/-0.9)% 

2.7  (+l.4/-l.3)% 

4.0  (+1.7/- 1.5)% 

Modified  RBF 

2.0(+1.3/-l.l)% 

11.5(+2.7/-2.6)% 

10.3  (+2.6/-2.4)% 

Table  8.3:  Empirical  benchmark  test  sample  error  rates  for  650-parameter  classifiers  produced  from  the 
linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential  learning  ( )  via  the  CFM 
objective  function  and  probabilistic  learning  ( Ap )  via  the  MSE  and  CE  objective  functions.  Ninety  five 
percent  confidence  intervals  are  based  on  the  assumption  that  the  error  rates  are  binomially  distributed  [62]. 


Benchmark  Rejection  Rates  /  Error  Rates 

Hypothesis 

Learning  Suutegy 

Class 

Aa  (CFM) 

Ap(MSE) 

Ap(CE) 

Linear 

Fraction  of  Tat  Sample 
Rejected 

4.2(+l.7/-l.6)% 

8.7  (+2.6/-2.3)% 

N/A 

Fraction  of  Un-Rqected 
Teat  Sample 
Misclaisined 

0.9  (+0.9/-0.8)% 

l.8(+1.3/-l.l)% 

N/A 

Logistic  Linear 

Fraction  of  Teat  Sample 
Rejected 

5.8  (+2.l/-l.9)% 

4.8  (+1.9/- 1.7)% 

5.5  (+2.0/-1.8)% 

Fraction  of  Un-Rejected 
Teat  Sample 
Miaclassified 

0.4(+0.6/-0.4)% 

l.2(+l.l/-0.9)% 

1.8  (+1.2/-l.l)% 

Modified  RBF 

Fraction  of  Teat  Sample 
Rejected 

5.8  (+2.!/- 1.9)% 

15.2  (+3.0/-2.9)% 

16.5  (+3.l/-3.0)% 

Raction  of  Un-Rejected 
Teat  Sample 
Miiclauified 

0.4(-M).6/-0.4)% 

5.3  {+2.1/-2.0)% 

3.0  (+1.7/- 1.5)% 

Table  8.4:  Benchmark  test  sample  rejection/empirical  error  rate  statistics  for  6S0-parameter  classifiers 
produced  from  the  linear,  logistic  linear,  and  modified  RBF  hypothesis  classes  by  differential  learning  (  Aa  ) 
via  the  CFM  objective  function  and  probabilistic  learning  ( Ap )  via  the  MSE  and  CE  objective  functions. 
Ninety  five  percent  confidence  intervals  are  based  on  the  assumption  that  the  error  rates  are  binomially 
distributed  [62]. 
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As  in  section  7.6,  we  see  that  the  inefficiency  of  probabilistic  learning  has  a  deleterious  effect  on 
the  process  by  which  the  classification  hypothesis  is  accepted  or  rejected.  Moreover,  the  rejection/error 
statistics  of  probabilistically-generated  classifiers  are  quite  sensitive  to  the  choice  of  hypothesis  class. 
This  phenomenon  is  consistent  with  the  theoretical  arguments  of  section  3.4,  in  which  we  make  a  clear 
distinction  between  minimizing  discriminant  error  via  differential  learning  and  minimizing  functional  error 
via  probabilistic  learning.  The  efficiency  of  differential  learning  ensures  consi.stent  rejcction/etror  statistics 
across  a  wide  range  of  hypothesis  classes. 

8.5  Recognition  Results  in  the  Presence  of  Noise 

As  described  in  section  3.6,  there  are  special  cases  in  which  probabilistic  learning  generates  the  efficient 
classifier  of  X  for  small  training  sample  sizes,  whereas  differential  learning  does  so  only  for  large  training 
sample  sizes.  Specifically,  when  the  hypothesis  class  is  a  proper  parametric  model  of  X  probabilistic 
learning  will  generate  the  efficient  classifier  of  X. 

This  is  illustrated  in  the  case  of  the  DBI  OCR  task  when  the  original  256-pixel  binary  images  are 
corrupted  by  noise  that  takes  the  form  of  random  independent  pixel  inversions  throughout  the  image.  This 
form  of  noise  is  originally  described  in  [41].  When  the  probability  of  pixel  inversion  becomes  relatively  high 
and  the  noise-corrupted  images  are  subsequently  compressed  using  the  simple  linear  lossy  scheme  described 
in  section  M.3,  the  resulting  feature  vector  exhibits  class-conditional  pdfs  that  are  very  nearly  Gaussian  with 
homoscedastic  covariance  matrices.  As  a  result,  both  fully-parametric  and  partial ly-parametric  proper  models 
exist  for  these  compressed  noisy  characters. 

We  characterize  the  independent  noise  source  as  an  additive  one  in  order  to  derive  a  simple  expression 
for  the  signal-to-noise  ratio  (SNR)  of  the  un-compressed  noise-corrupted  images.  We  then  prove  by  an 
application  of  the  central  limit  theorem  that  compressing  the  noisy  images  generates  an  approximately 
homoscedastic  Gaussian  feature  vector,  the  approximation  being  better  as  the  SNR  drops  toward  -0.8  dB. 
We  find  that  differential  learning  generates  a  mote  efficient  classifier,  given  any  choice  of  hypothesis  class, 
as  long  as  the  un-compressed  image  SNR  remains  above  approximately  2  dB.  When  the  SNR  drops  to  1.2 
dB  the  logistic  linear  hypothesis  class  becomes  a  good  approximation  to  the  proper  parametric  model  of  the 
compressed  noisy  feature  vector.  As  a  result,  classifiers  generated  probabilistically  from  this  hypothesis  class 
with  training  sample  sizes  of  n  fa  600  are  more  efficient  than  their  differentially  generated  counterparts. 

8.5.1  Signal-tO'Noise  Ratio  (SNR)  Computations 

Recall  from  section  8.1  that  the  feature  vector  for  the  original  un-compressed  DBI  digits  is  a  2S6-pixel 
(16  X  16)  binary  vector  with  elements  of  + 1  /- 1 ,  where  black  =  - 1  and  white  =  + 1 .  In  order  to  simplify  our 
SNR  expressions,  we  view  this  binary  vector  X  as  a  simple  affine  transformation  of  another  binary  vector 
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X  with  elements  of  {0,1),  where  black  =  I  and  white  =  0: 

X  €  X  =  {-MP". 

X  =  -2X  +  I;  (8.7) 

X  =  {x . .Jr256},  X  G  {0,l}“« 

Note  that  I  denotes  the  identity  vector.  We  refer  to  the  un-compressed  vector  X  rather  than  X 
throughout  the  remainder  of  this  chapter,  trusting  that  the  resulting  mathematical  simplifications  are  worth 
the  sleight-of-hand  in  (8.7). 

Given  X  described  above,  we  can  characterize  a  noise-corrupted  version  of  it,  in  which  pixels  are 
independently  and  randomly  inverted;  the  expression  models  the  noise  source  as  an  additive.  Bernoulli- 
distributed  random  vector 


^  =  {s, . sas)  €  (8.8) 

with  diagonal  covariance  matrix  E  =  where  1  denotes  the  identity  matrix,  and  the  Bernoulli 

probabilities  and  are  given  by'° 


Pc  =  P(«.V  =  -1) 

\  -  P<  =  P(<.i  =  0) 


(8.9) 


Given  this  expression  for  the  random  noise  vector,  the  noise-corrupted  version  of  X ,  which  we  denote  by 
ly,  is  given  by 


t/  =  \x  +  <;\ 

s.t.  Vi  =  -1-  c,|  Vf 


(8.10) 


In  simple  terms,  all  2S6  noise  vector  elements  are  independent  and  identically  distributed  (i.i.d.)  Bernoulli 
random  variables:  the  probability  of  pixel  inversion  is  pc ,  and,  by  (8.8)  and  (8.10),  when  c,  =  -I  the 
noise-corrupted  pixel  Vi  is  the  inverse  of  its  un-comipted  counterpart  x,: 


=  < 


Xi, 

I. 

0, 


c,  =  0 

C,  =  —  I  n  X;  =  0 
V  =  -I  n  X,  =  1 


(8.11) 


"’Note:  ■  Bernoulli  random  variable  that  assumes  the  value  -  I  with  probability  p  and  the  value  0  with  probability  g  =  I  -  p 
will  have  a  mean  value  of  -p  and  a  variance  p^  =  p(l  -  p).  See  |28.  pg.  244)  for  an  example  of  the  first  and  second  moment 
computations  underlying  these  expressions. 
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In  order  to  derive  a  simple  expression  for  the  SNR  of  the  noise-corrupted  feature  vector,  we  make  the 
simplifying  assumption  that  each  element  of  X  is  itself  a  Bernoulli  random  variable  when  considered  over 
the  set  of  all  10  digits,  and  that  all  the  elements  of  X  are  i.i.d.  This  assumption  is  of  course  invalid,  but  we 
argue  that  it  is  permissible  for  the  purpose  of  computing  a  simple  measure  of  the  noise-corrupted  images' 
SNR.  Based  on  the  statistics  of  the  un-corrupted  DBI  database,  the  probability  of  a  black  pixel  px  is  0.30, 
and  the  probability  of  a  white  pixel  <7,  is  0.70: 


Px 


P(£J=_L)  = 

htnck  pixel 

I  -  p,  =  P(x,  =  0)  =  0.7 

w-hite  pixel 


(8.12) 


As  a  result,  we  can  express  the  SNR  of  the  noise-corrupted  DB I  database  in  terms  of  the  variances  of  the 
Bernoulli  random  variables  jr,  and  c,  thus: 


10  log,0  1 

fVar[x,] 

i,Var[c,l 

^  dB 

10  logio  1 

fpxqx\ 

{p<>iJ 

dB 

(8.13) 

10  log,o  1 

[— ) 

dB 

8.5.2  The  Compressed  Noisy  Feature  Vector  is  Approximately  Homoscedastic  Gaus> 
sian 

Using  the  compression  scheme  of  section  M.3,  we  compress  four  neighboring  binary  pixels  into  one  S-state 
pixel.  In  so  doing,  the  (16  x  16)  image  is  compressed  to  an  (8  x  8)  image.  Consider  a  single  cluster  of 
four  noise-corrupted  pixels  {o] ,  1^ ,  O} ,  U4)  in  the  original  2S6-pixel  image.  This  group  of  four  pixels  forms 
a  single  pixel  v'  in  the  compressed  image  by  the  following  transformation: 

=  4  S  =  4  I-''  +  ^'1 

(=1  1=1 

Sixty-four  of  these  compressed  pixels  form  the  compressed  noise-corrupted  feature  vector  I/'.  We  omit  a 
subscript  when  discussing  individual  pixels  (i.e.,  individual  elements  of  |/' ),  opting  instead  for  the  generic 
pixel  notation  t/  in  (8.14).  We  do  this  in  the  interest  of  notational  simplicity,  assuming  that  the  reader 
understands  the  relationship  between  the  vector  U'  and  one  of  its  64  constituent  pixels  o'. 
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If  we  consider  that  the  value  of  a  pixel  (black  or  white)  in  the  original  un-corrupted  256-pixel  binary  image 
depends  on  the  digit  (i.e.,  the  class  that  the  image  represents),  we  can  express  (8.14)  in  its  class-conditional 
form: 


1  4  4 

v'\UJj  =  4  =  4  H  l-^'M  +  <•'  !  (8.15) 

#=l  1=1 

In  fact  it  is  legitimate  to  consider  the  ith  pixel  x,  in  the  original  256-pixel  un-corrupted  binary  image  as  a 
Bernoulli-distributed  random  variable  with  a  class-conditional  parameter  .  That  is,  for  a  given  digit  UJj , 
X,  will  be  black  with  probability  and  white  with  probability  =  1  —  such  that  the  pdf  of 
Xi,  given  CJj,  is" 


Px\w(Xi\^i)  =  <*(■»)  +  Px.\^'jHx  -  >).  (816) 

where  8  denotes  the  Dirac  delta  function  (e.g.,  (80,  pg.  266)).  From  (8.8)  and  (8.9)  the  pdf  of  the  noise 
vector  element  a  is 


=  P,S(<. -^^  \)  +  q,6(^)  (8.17) 

Since  x,-  and  sv  ate  independent,  the  pdf  of  x,  +  is  the  convolution  of  their  individual  pdfs  (e.g.,  see 
(25,  pg.  36]).  As  a  result,  the  pdf  of  i/,|aiy  is,  by  (8.10),  (8.16),  and  (8.17), 


Pr\wi''>\^j)  = 

PsP.Xilwj  +  9c9»,|i*7 

(5(|/)  -1- 

Fc  9r,ls  +  F'.l-v  9c  1 

^  J 

Equation  (8. 1 8)  shows  that  r/,-  is  itself  a  Bernoulli-distributed  random  variable  with  parameter 

Although  all  the  v  are  independent,  the  x,  ate  not.  The  hand-strokes  that  generated  the  original  DB I 
images  induced  a  fair  amount  of  spatial  correlation  in  the  pixels.  We  make  the  incorrect  but  simplifying 
assumption  that  the  pixels  are  indeed  independent  so  that  we  can  derive  a  tractable  approximation  to  the  pdf  of 
the  compressed,  class<onditional  noise-corrupted  pixel  v'\UJj.  By  assuming  statistical  independence  among 
the  original  class-conditional  constituent  pixels  x\\Uj , we  can  express  the  class-conditional  pdf 
as  the  convolution  of  p,,\y\{t'i\(jpj) . /7,,|,y.(i/4|Cc>;),  where  ♦  denotes  the  convolution 

operator: 

' '  Since  Xj  is  countable,  it  has  a  probability  mn.r.t  function  (pmf)  rather  than  a  probability  density  function  (pdO  We  consider  the  pmf 
a  special  ca.se  of  the  pdf  in  which  the  pdf  is  expressed  as  a  sum  of  Dirac  delta  functions,  thus  the  expression  in  (8.16) 
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Pv'\wW\^i)  —  *  Pi>\w(^*\^j)  (8-19) 


=  li  s 


i=\ 


*=l 


y 


f  Pvi  juJy  Pl/jjal/  9l'»|wy  9t'4|‘^y  P»'j|i^y  Pl'JlWl  9*'l  ll*'/  9^4  |u>/  ^ 

"I"  Pl'llu/Pl/^li-y  9l'||<»ly  fl|/2|a>y  +  Pl/lluy  Pl'l|u)y  9|4i|u.',  94'4|ii'/ 

V  '^Pl‘'l|-*>;Pl'4|‘*’j  9l^tu)y9l/«|uly  +  Pl/jlwyPl'4py  fli'iju;  / 


/ 


»=l 


\ 


n  Pi'yk’/^i'.k, 


1=1 


V5i 


«S{»^'-  +  n  PlM-’y^^'^'  “  ') 


(8.20) 


1=1 


Equation  (8.20)  is,  in  fact,  a  kind  of  binomial  pdf  (or  pmf,  stricily  speaking)  for  the  noisy  compressed 
class-conditional  pixel  l■''\OJj.  The  expression  reduces  to  the  familiar  binomial  form  (n  =  4,p  =  P»j|i..y) 
if  all  the  in  (8.16)  — and,  as  a  result,  all  the  in  (8.20)  — are  equal.  The  central  limit 
theorem  assures  us  that  the  sum  of  a  large  number  of  independent  random  variables  will  have  a  pdf  that 
is  approximately  Gaussian.  Indeed,  the  DeMoivre  —  Laplace  approximation  can  be  viewed  as  a  special 
expression  of  the  central  limit  theorem,  by  which  binomial-distributed  random  variables  are  shown  to  be 
very  nearly  Gaussian.  The  approximation  is  fairly  good  when  the  number  of  Bernoulli  trials  giving  rise  to 
the  binomial  distribution  is  C7(  10]  or  greater  and  the  binomial  parameter  p  «  5  (e.g.,  [63,  pg.  186J).  Note 
that  since  our  compressed  image  pixel  is  the  sum  of  four  binary  image  pixels  (i.e.,  the  sum  of  a  total  of  four 
Bernoulli  trials),  the  pdf  Py«nv('''1^/)  '*>  increasingly  good  approximation  to  a  Gaussian  random  variable 

as  the  noise  probability  p<  approaches  5  (this  corresponds  to  a  signal-to-noise  ratio  of  SNR  -»  —0.8  dB). 
Again,  the  goodness  of  the  Gaussian  approximation  is  diminished  somewhat  by  the  invalid  assumption  that 
the  original  image  pixels  are  independent;  nevertheless,  the  approximation  proves  to  be  reasonably  good  as 
the  image  SNR  drops  below  2  dB  (i.e.,  when  p^  >  0. 13 ). 

Figure  8.22  illustrates  the  pdf  black  arrows  denote  the  Dirac  delta  functions  of  the 

expression  in  (8.20)  when  the  probability  of  pixel  inversion  is  p<  =  0.2  and  is  assumed  to  be  0.3  for 
all  i.  Under  these  conditions  the  OCR  image  signal  to  noise  ratio  is  SNR  =  1.2  dB,  and  p„'\w{^'\^i)  's 

indeed  a  good  approximation  to  the  Gaussian  pdf  shown  in  light  gray. 

As  the  SNR  drops  below  2  dB,  the  compressed  image  pixels  are  increasingly  independent  of  one  another, 
since  the  noise  statistics  begin  to  dominate  the  image.  Under  these  circumstances,  the  noise-corrupted 
class-conditional  feature  vectors,  which  we  denote  by  (j  =  I ,  ...  ,  10),  become  homoscedastic 
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Figure  8.22:  The  probability  density  function  (shown  with  black  arrows  denoting  the  Dirac 

delta  functions  of  the  pdO  for  the  noisy  compressed  DBl  image  pixel  u'\U>j  when  the  probability  of  pixel 
inversion  is  =  0.2,  so  that  the  (X^R  image  signal  to  noise  ratio  is  SNR  =  1.2  dB.  For  the  purpose  of 
this  particular  graphic,  it  is  assumed  that  each  noise-free  pixel  in  the  four  being  compressed  ( , . . .  ,X4}  ) 
has  a  30%  probability  of  being  black  (i.e.,  is  assumed  to  be  0.3  for  /  =  1 ,  . . .  ,4).  TTie  gray-shaded 
background  is  the  Gaussian  pdf  that  approximates. 


Gaussian-distributed.  They  satisfy  the  conditions  of  appendix  F  such  that  the  logistic  linear  hypothesis 
class  constitutes  the  partially-parametric  proper  model  of  1/'.  This  phenomenon  is  demonstrated  in  the 
experiments  that  follow. 

8.5.3  Recognition  Results  for  a  Moderate  SNR 

Figure  8.23  shows  moderately  noise-corrupted  versions  of  the  compressed  DB  1  digits  shown  in  figure  8.5. 
The  images  are  arranged  in  a  random  order  so  that  the  reader  has  no  order-based  cues  for  recognizing  the 
digits.  The  compressed  images  are  derived  from  the  original  2S6-pixel  binary  images  after  the  latter  have 
been  corrupted  by  a  noi.se  source  with  =  0.1.  That  is,  the  binary  pixels  of  the  original  images  are 
inverted  (or  "flipped”)  with  probability  0.1,  a  moderate  amount  of  noise.  By  (8.13),  the  noisy  image  SNR 
is  therefore  3.7  dB.  The  noise-corrupted  256-pixel  images  —  from  which  those  in  figure  8.23  are  formed  by 
64  compression  operations  of  the  form  in  (8.14)  —  are  shown  in  figure  8.33  (page  269).  The  sequence  of 
digits  in  figure  8.33  is  different  from  that  in  figure  8.23.  Both  sequences  are  given  on  page  270,  although  we 
ask  the  reader  to  indulge  us  by  not  peeking  at  the  answers  for  a  while. 

Figure  8.24  shows  the  parameters  of  the  logistic  linear  classifier  generated  by  differential  learning  from 
the  first  of  the  25  moderately  noise-corrupted  training  samples.  The  parametric  entropy  of  all  the  classifier’s 
parameters  is  3.46,  versus  3. 1 8  for  the  parameters  of  the  logistic  linear  classifier  differentially  generated  from 
the  noise-free  benchmark  training  sample  (cf  figures  8.24  and  8.6.  page  228).  The  increased  parametric 
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Figure  8.23:  Moderately  noisy  versions  of  the  digits  shown  in  figure  8.5.  The  order  of  the  images  has  been 
randomized  so  that  the  digits  they  represent  cannot  be  infened  from  their  position  on  the  page.  The  correct 
labels  for  these  examples  are  shown  in  figure  8.35.  We  encourage  the  reader  to  classify  the.se  images  prior 
to  looking  at  the  correct  labels.  Figure  8.33  shows  the  original  256-pixel  noisy  images  from  which  these 
linearly  compressed  images  were  derived  (the  image  sequence  is  different). 


Ftgure  8.24:  Parameters  of  the  differential  logistic  linear  classifier  generated  from  the  first  of  25  moderately 
noisy  DBl  database  training  samples.  Each  of  the  50  samples  (25  training  and  25  test,  each  containing 
approximately  600  randomly  selected  examples)  is  corrupted  with  a  different  realization  of  the  moderate- 
intensity  noise  source.  The  entropy  of  these  parameters  is  3.46.  reflecting  the  increased  entropy  of  the 
noise-corrupted  images  versus  the  original  images  (cf.  figure  8.6). 
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entropy  reflects  the  increased  information  content  of  the  training  sample;  the  information  increase  stems  from 
the  noise  added  to  the  images.  Figure  8.25  summarizes  the  empirical  test  sample  error  rates  for  classifiers 
generated  from  the  logistic  linear  hypothesis  class  by  differential  and  probabilistic  learning.  The  statistics 
arc  obtained  from  the  same  25  independent  randomly-generated  partitions  of  the  DBI  database  used  in  the 
earlier  noise-free  experiments.  The  noise  corruption  for  each  of  the  25  different  training/test  samples  is 
independent  of  that  for  any  and  all  other  trials.  Figure  8.25  shows  that  the  differentially-generated  logistic 
linear  classifier  is  somewhat  more  efficient  than  its  probabilistically-generated  counterparts  for  this  nominal 
training  sample  size  and  this  moderate  amount  of  noise  corruption  (SNR  =  3.7  dB).  The  average  empirical 
test  sample  error  rate  is  6.6%  for  the  differentially-generated  classifier,  versus  7.0%  (MSE)  and  7.5%  (CE) 
for  the  probabilistically-generated  classifiers.  All  cla.ssifiers  exhibit  about  the  same  empirical  discriminant 
variance  ( ~  1.2  x  10"'* ).  The  right-hand  side  of  figure  8.25  shows  that,  although  the  differentially-generated 
classifier's  empirical  MSDE  is  lower  than  the  probabilistically-generated  classifiers',  differential  learning 
does  not  ronsistently  generate  the  classifier  with  the  lowest  empirical  test  sample  error  rate.  Probabilistic 
learning  via  MSE  generates  a  classifier  with  a  lower  empirical  test  sample  error  rate  for  about  one  third 
of  all  trials,  and  probabilistic  learning  via  CE  generates  a  classifier  with  a  lower  empirical  test  sample 
error  rate  for  about  one  quarter  of  all  trials.  The  SNR  of  3.7  dB  resulting  from  the  noise-corruption  has 
altered  the  class-conditional  pdfs  of  the  digits  in  such  a  way  that  they  have  become  better  approximations 
to  homoscedastic  Gaussian  pdfs.  As  a  result,  the  probabilistically-generated  logistic  linear  classifier  can 
learn  the  noisy  compressed  images  more  efficiently  relative  to  its  differentially-generated  counterpart  than 
it  can  when  the  images  arc  noise-free.  Specifically,  consider  the  estimated  relative  efficiency  of  one  learning 
strategy  versus  another,  given  a  specific  hypothesis  class  and  a  specific  set  of  training/test  samples: 

Definition  8,6  The  estimated  relative  efficiency  of  one  learning  strategy  versus  another:  The 

estimated  relative  efficiency  (ERE  or  RE  |.\,A  |  {ri| . Ok)  ,G(0)j  )  of  one  learning  strategy  A 

versus  another  A',  given  a  specific  hypothesis  class  G(0)  and  a  specific  set  of  K  training/test  samples 

of  sizes  {wi . and  {f/i . »/«}  respectively,  is  simply  the  ratio  of  their  estimated  MSDEs 

(definition  8.5): 


RE  [a.  A' I  {/i|.  . 


.«4,G(0)]  i 


MSDE  [Q  \  {/t| . wy}.G(e).A] 

M^E  g\{n . n,r},G(©).A'] 


(8.21) 


For  K  =  25  learning  and  testing  trials  (sample  sizes  are  n,  «  ij,  «  600  i  =  I . 25 )  with  the 

noise-free  compressed  digits  and  the  logistic  linear  hypothesis  class  G(©).  the  ERE  of  differential  learn¬ 
ing  (  Aa  )  versus  probabili‘(ic  learning  via  CE  ( Ap.cr  I  is  RE  [A,i  .  Apn;  |  {/ii . «*r}  .G(0)]  = 
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0.26.'^  The  ERE  of  differential  learning  versus  probabilistic  learning  via  MSE  (Ap-mse)  is 
RE  [Aa  ,  Ap-MSE I  {«i .  •••  .»/r}  .G(0)]  =  0.38.  These  ERE  figures  are  shown  in  table  8.5,  and 
they  indicate  that  the  differentially-generated  logistic  linear  classifier  is  three  to  four  times  as  efficient 
as  its  probabilistically-generated  counterparts  for  a  nominal,  noise-free  training  sample  size  of  n  «  600 
compressed  digits. 


Estimating  the  Bayes  error  rate  for  the  noisy,  compressed  images:  There  is  strong  evidence  that  the 
Bayes  error  rate  for  the  noise-free,  compressed  images  is  ind  i  zero.  However,  we  are  merely  guessing  the 
Bayes  error  rate  for  the  noisy,  compressed  images;  it  is  important  that  the  reader  understand  this,  because 
our  gue.ss affects  the  EREs  that  we  quote  below.  We  a.ssume  that  P,  =  5%  for  the  moderately  noisy, 

compressed  images;  we  assume  that  P,  (^B<i\rt)  =  12%  for  the  very  noi.sy,  compressed  images  described  in 
section  8.5.4.  These  estimates  correspond  to  the  lowest  empirical  test  sample  error  rate  exhibited  by  any  clas¬ 
sifier  in  any  of  the  25  learning/testing  trials  run  at  each  SNR.  Clearly,  if  the  estimates  are  low.  then  our  MSDE 
estimates  based  on  them  will  be  high;  if  the  estimates  are  high,  then  our  MSDE  estimates  based  on  them  will 
be  low.  Bias  in  our  MSDE  estimates  will,  of  course,  introduce  bias  in  our  ERE  estimates.  In  simple  terms,  if  we 
have  over-estimated  the  Bayes  e  non  ate,  our  ERE  values  will  be  biased  away  from  a  value  of  unity’;  that  is,  all 
MSDE  estimates  will  be  lower  than  their  actual  values,  a  phenomenon  that  e.xaggeratesthe  difference  betw  een 
learning  strategy  efficiencies.  If  we  have  under-estimated  the  Bayes  error  rate,  our  ERE  values  will  be  biased 
towards  a  value  of  unity;  that  is.  all  MSDE  estimates  will  be  higher  than  their  actual  values,  a  phenomenon 
that  obscures  any  differences  between  leaniing  strategy  efficiencies.  IVe  believe  that  our  Bayes  error  rate 
estimates  are  reasonable;  nevertheless,  u  e  urge  the  reader  to  interpret  the  resulting  EREs  conservatively. 


If  we  assume  that  the  estimated  Bayes  error  rate  is  5%  for  the  moderately  noisy  compressed  images  (we 

stress  that  this  number  is  a  guess),  the  resulting  EREs  are  RE[A^,Ap.ce|  {ni . /i/r}  ,G(0)]  =  0.48 

and  RE  [A,^  .  Ap.mse  |  {«i .  •  •  ■  ,G(0))  =  0.75  (see  table  8.5).  These  ERE  figures  indicate  that  the 

differentially-generated  logistic  linear  classifier  is  between  1 .3  and  2  times  as  efficient  as  its  probabilistically- 
generated  counterparts  for  a  nomi*’''*  training  sample  size  of  600  moderately  noisy,  compressed  digits.  Note 
that  the  ERE  of  differential  learn  5.  given  the  logistic  linear  hypothesis  class,  has  increased  with  decreasing 
SNR  —  evidence  that  the  compressed  noisy  images  are  more  Gaussian-like  (and,  as  a  result,  more  properly 
modeled  by  the  logistic  linear  hypothesis  class)  than  the  noise-free  compressed  images  were. 


'^We  remind  Ihe  reader  lhal  the  esiimaied  Bayes  error  rale  P,  Ba\r^)  for  ihe  noise-free  compressed  images  is 
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Figure  8.25:  Left:  Test  sample  classirication  summaries  for  the  6S0-parameter  logistic  linear  classifier 
employing  differential  learning  (  Aa  )  and  two  forms  of  probabilistic  learning  ( Ap ).  The  summaries  are 
based  on  25  independent  trials  in  which  the  DBI  database  is  randomly  partitioned  into  moderately  noisy 
training  and  test  samples,  each  containing  approximately  600  examples.  The  box  plots  are  a  non-parametric 
depiction  of  the  empirical  test  sample  error  rate’s  distribution  over  the  25  trials;  the  whisker  plots  depict 
#  the  average  empirical  test  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing 

each  classifier’s  MSDE.  Right:  The  increase  in  the  discriminant  error  of  the  two  probabilistically-generated 
models  over  the  differentially-generated  model  on  a  trial-by-trial  basis.  These  box  plots  show  that  differential 
learning  does  not  always  generate  the  classifier  with  the  lowest  empirical  error  rate.  This  is  due  to  the  low 
(3.7dB)  signal-to-noise  ratio  (SNR)  of  the  examples  and  the  compression  scheme  we  employ  (see  figure  8.23): 
the  post-compression  pdf  of  the  noisy  feature  vector  becomes  increasingly  Gaussian-like  as  the  SNR  drops, 
so  the  probabilistically-generated  logistic  linear  classifier  becomes  an  increasingly  good  approximation  to 
0  the  proper  parametric  model  of  the  noisy,  compressed  digits. 
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Figure  8.26:  Left:  The  test  sample  classification  summaries  shown  in  figure  8.25  for  the  650-parameter 
(64-pixels/digit)  logistic  linear  classifier  employing  differential  learning  (  Aa  )  and  two  forms  of  probabilistic 
learning  ( Ap ).  Right:  Classification  summaries  for  fiOeen  human  subjects  asked  to  classify  the  40  64-pixel 
examples  shown  in  figure  8.23.  Far  Right:  Classification  summaries  for  fifteen  different  human  subjects 
asked  to  classify  the  40  256-pixel  examples  shown  in  figure  8.33,  from  which  the  compressed  versions  in 
figure  8.23  are  derived. 
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Estimated  Relative  Efficiency  (ERE)  of  Differential  Learning 

n  «  600 


Hypothesis 

Alternative  Learning  Strategy 

Class 

SNR 

Ap(MSE) 

Ap(CE) 

oo  dB" 

0.12 

N/A 

Linear 

3.7  dB'’ 

Oil 

N/A 

1.2  dB" 

0.40 

N/A 

oo  dB" 

0.38 

0.26 

Logistic  Linear 

3.7  dB'’ 

0.75 

0.48 

1.2  dB" 

2.01 

2.28 

oo  dB" 

0.12 

0.16 

Modified  RBF 

3.7  dB 

— 

— 

1.2  dB 

— 

— 

Table  8.5:  The  estimated  relative  efficiency  (ERE) — see  definition  8.6 — of  differential  versus  probabilistic 
learning  for  the  linear,  logistic  linear,  and  modified  Gaussian  RBF  hypothesis  classes  (the  nominal  training 
sample  size  is  n  »  600  compressed  digits).  If  ERE  <  1,  differential  learning  generates  a  more  efficient 
classifier  from  the  hypothesis  class  than  the  alternative  learning  strategy  does,  given  the  SNR.  If  ERE  >  1, 
differential  learning  generates  a  less  efficient  classifier  from  the  hypothesis  class  than  the  alternative  learning 
strategy  does,  given  the  SNR.  ERE  estimates  are  based  on  25  independent  leaming/testing  trials  in  which 
the  DBI  database  is  randomly  partitioned  into  training  and  test  samples,  each  containing  approximately  600 
examples.  Estimates  are  shown  for  all  hypothesis  classes,  given  the  noi.se-free  digits  (SNR  =  oo).  Estimates 
are  shown  only  for  the  linear  and  logistic  linear  hypothesis  classes,  given  the  noisy  digits  (SNR  =  3.7  dB  and 
1.2  dB).  Note  that  the  only  case  in  which  differential  learning  does  not  generate  the  most  efficient  classifier 
for  a  nominal  training  sample  size  of  6(X)  is  when  the  SNR  =  1 .2  dB  and  the  logistic  linear  hypothesis  class  is 
employed.  Under  these  conditions,  the  logistic  linear  hypothesis  class  is  a  go<^  approximation  to  the  proper 
parametric  model  of  the  very  noisy,  compressed  digits,  so  the  CE-generated  logistic  linear  classifier  is  a  good 
approximation  to  the  efficient  classifier. 

'Estimated  Bayes  enw  me  (i.e..  P,  )  is  assumed  to  be  OSt  for  the  noise-free  comptessedima^s. 

^Estimated  Bayes  error  me  is  assumed  to  be  .S%  for  the  moderately  noisy  compressed  images. 

‘'Estimated  Bayes  error  me  is  assumed  to  be  12%  for  the  very  noisy  cotr^ssed  images. 


Human  Recognition  of  the  Moderately  Noisy  Images 

Figure  8.26  replicates  the  box  plots  in  the  left-hand  side  of  figure  8.25  alongside  box  plots  that  summarize 
the  empirical  error  rates  of  fifteen  human  subjects.  The  human  subjects  were  asked  to  classify  the  forty 
images  of  figure  8.23,  and  their  empirical  error  rates  were  computed  according  to  (8.1)  using  r;  =  40. 
The  human  experiment  was  not  rigorously  matched  against  the  machine  experiments.  The  humans  were  not 
given  any  training  examples:  instead  they  relied  solely  on  their  prior  knowledge  of  digit  forms  to  perform 
the  classification  task.  Also,  the  humans  made  their  classifications  by  viewing  a  laser-printed  version  of 
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the  40  examples  in  figure  8.23,  whereas  the  machine  learned  and  subsequently  recognized  numeric  feature 
vectors.  Presumably  the  printed  character  images  do  not  contain  the  same  information  as  their  numeric 
representations,  owing  to  non-linearities  in  the  production  and  perception  of  the  gray-scale  representations  of 
the  compressed  feature  vector.  In  short,  the  experiment  was  almost  surely  bia.sed  against  the  human  subjects. 
These  handicaps  notwithstanding,  all  subjects  were  electrical  and  computer  engineering  graduate  students, 
who,  tending  to  be  fundamentally  insecure  over-achievers,  were  motivated  to  perform  well  on  the  task.” 

Fifteen  subjects  classified  the  64'pixel  images  in  figure  8.23,  and  fifteen  different  subjects  classified 
the  256-pixel  “parent”  images  in  figure  8.33.  Figure  8.26  (right)  shows  that  the  median  empirical  test 
sample  error  rate  for  humans  was  15%  for  the  compressed  images  —  approximately  twice  the  differentially- 
generated  logistic  linear  classifier's  median  rate.  Moreover,  the  human  subjects’  discriminant  variance  was 
more  than  an  order  of  magnitude  higher  than  the  logistic  linear  classifiers’,  as  indicated  by  the  comparative 
spans  of  the  box  plots  for  the  compressed  image  experiments.  At  this  point,  we  encourage  the  reader  to 
classify  the  images  in  figures  8.23  and  8.33;  then  determine  your  empirical  error  rates  by  comparing  your 
classifications  with  the  answers  in  figures  8.35  and  8.37. 

Note  that  the  human  subjects  who  classified  the  un-compressed  noisy  images  had  a  much  lower  median 
empirical  error  rate  of  2.5%  than  the  median  human  rate  for  the  compressed  images  (figure  8.26,  far  right),  a 
phenomenon  that  we  discuss  further  in  section  8.5.4. 

Learning  and  Recognizing  the  Moderately  Noisy  Images  with  the  Linear  Hypothesb  Class 

Figure  8.27  summarizes  the  empirical  test  sample  error  rates  for  classifiers  generated  from  the  linear 
hypothesis  class  by  differential  and  probabilistic  learning.  The  statistics  are  obtained  from  the  same  25 
independent  randomly-generated  partitions  of  the  DBI  database  used  in  the  logistic  linear  experiments. 
Figure  8.27  shows  that  the  differentially-generated  linear  classifier  is  more  efficient  than  its  probabilistically- 
generated  counterpart  for  this  moderate  amount  of  noise  corruption  (SNR  =  3.7  dB).  The  average  empirical 
test  sample  error  rate  is  6.1%  for  the  differentially-generated  classifier,  versus  9.7%  for  the  probabilistically- 
(MSE>-generated  classifier.  Both  classifiers  exhibit  about  the  same  empirical  discriminant  variance  ( 

1 .3  X  I0~^ ).  The  right-hand  side  of  figure  8.27  shows  that  the  differentially-generated  classifier  consistently 
generates  the  classifier  with  the  lowest  empirical  test  sample  error  rate.  Probabilistic  learning  via  MSE 
generates  a  classifier  with  an  empirical  lest  sample  error  rate  that  is  typically  about  1.3  times  that  of  its 
differentially-generated  counterpart.  The  MSE-generated  classifier’s  empirical  MSDE  is  about  10  times  the 
differentially-generated  cla.ssifier’s.  This  is  reflected  in  the  differential  learning  strategy’s  ERE,  which  is 

RE  [Aa  .  Ap-MSE  I  . n/e)  ,G(0)]  =  O.ll  for  the  linear  hypothesis  class  and  the  3.7  dB  SNR. 

'  ’We  were  surprised  by  the  number  of  subjects  who  look  the  experimeni  as  a  real  challenge  and  wanted  lo  know  how  well  they  Imki 
done  relative  lo  the  whole  subject  population  When  lold  that  the  machine  did  far  heller  than  they  did.  these  subjects  seemed  uniformly 
relieved  to  know  that  good  performance  on  the  task  didn’t  require  "real"  intelligence 
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f  igure  8.27:  Left:  Test  sample  classirication  summaries  for  the  6S0-parameter  linear  classifier  employing 
differential  learning  <  )  and  the  MSE  form  of  probabilistic  learning  ( Ap ).  The  summaries  are  based  on  25 
independent  trials  in  which  the  DB I  database  is  randomly  partitioned  into  moderately  noisy  training  and  test 
samples,  each  containing  approximately  600  examples.  The  box  plots  are  a  non-parametric  depiction  of  the 
empirical  test  sample  error  rate's  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average  empirical 
lest  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier's  MSDE. 
Right:  The  increase  in  the  discriminant  error  of  the  probabilistic  model  over  the  difTerentially-generated 
m^el  on  a  trial-by-trial  basis.  Because  the  probabilistically-generated  linear  classifier  remains  an  improper 
parametric  model  of  the  compressed  digits  with  decreasing  SNR,  differential  learning  always  generates  the 
classifier  with  the  lowest  empirical  error  rate/MSDE. 


These  results  stand  in  contrast  to  those  for  the  logistic  linear  classifier  (cf.  figures  8.25  and  8.27).  The 
probabilistically-generated  linear  classifier  is  consistently  worse  than  its  differs  ‘ially-generated  counterpart 
because  the  linear  hypothesis  class  remains  an  improper  parametric  model  of  the  noisy  DB  I  feature  vector 
for  all  SNR  values.  As  a  result,  differential  learning’s  asymptotic  efficiency,  which  holds  for  any  and 
all  hypothesis  classes,  generates  the  relatively  efficient  linear  classifier  for  these  training  sample  sizes  of 
n  »  600.  Experiments  were  not  conducted  with  the  modified  Gaussian  RBF  hypothesis  class  because  it  is 
clearly  an  improper  parameuic  model  of  both  the  noise-free  and  noisy  digits.  For  this  reason,  differential 
learning  would  surely  generate  the  most  efficient  RBF  classifier,  regardless  of  the  SNR,  just  as  it  does  for  the 
linear  hypothesis  class. 

Finally,  note  that  the  test  sample  error  rales,  empirical  MSDE,  etc.  for  the  differentially-generated  logistic 
linear  and  linear  classifiers  are  virtually  identical  (cf.  figures  8.25  and  8.27).  This  phenomenon  follows  the 
trends  we  see  in  the  noise-free  experiments,  and  reflects  the  asymptotic  efficiency  of  differential  learning. 
It  generates  classifiers  with  the  same  empirical  MSDE  from  these  two  hypothesis  classes  because  both 
hypothesis  classes  are  capable  of  forming  the  same  piece-wist  linear  boundaries  on  feature  vector  space, 
despite  their  functional  differences. 
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8.5.4  Recognition  Results  for  a  Low  SNR 

Figure  8.28  shows  highly  noise-corrupted  versions  of  the  compressed  DBl  digits  shown  in  figure  8.5.  Like 
their  moderately  noisy  counterparts,  these  images  are  arranged  in  a  random  order  so  that  the  reader  has  no 
order-based  cues  for  recognizing  the  digits.  The  compressed  images  are  derived  from  the  original  2S6-pixel 
binary  images  after  the  latter  have  been  corrupted  by  a  noise  source  with  =  0.2.  That  is,  the  binary 
pixels  of  the  original  images  are  inverted  (or  “flipped")  with  probability  0.2,  a  high  amount  of  noise.  By 
(8. 1 3),  the  noisy  image  SNR  is  therefore  1 .2  dB.  The  noise-corrupted  256-pixel  images  —  from  which  those 
in  figure  8.28  are  formed  by  64  compression  operations  of  the  form  in  (8.14)  —  are  shown  in  figure  8.34 
(page  269).  The  sequence  of  digits  in  figure  8.34  is  different  from  that  in  figure  8.28.  Both  sequences  ate 
given  on  page  270;  again,  we  ask  the  reader  to  indulge  us  by  not  peeking  at  the  answers  for  a  while. 

Figure  8.29  shows  the  parameters  of  the  logistic  linear  classifier  generated  by  differential  learning  from 
the  first  of  the  25  highly  noise-corrupted  training  samples.  The  parametric  entropy  of  all  the  classifier’s 
parameters  is  3.85,  versus  3.46  for  the  parameters  of  the  logistic  linear  classifier  differentially  generated  from 
the  analogous  moderately  noisy  training  sample,  and  3. 1 8  for  the  parameters  of  the  logistic  linear  classifier 
differentially  generated  from  the  noise-free  benchmark  training  sample  (cf.  figures  8.29,  8.24,  page  254, 
and  8.6,  page  228).  The  increased  parametric  entropy  reflects  the  increased  information  content  of  the 
training  sample  stemming  from  the  higher  level  of  noise  in  the  images.  Figure  8.30  summarizes  the  empirical 
test  sample  error  rates  for  classifiers  generated  from  the  logistic  linear  hypothesis  class  by  differential 
and  probabilistic  learning.  The  statistics  are  obtained  from  the  same  25  independent  randomly-generated 
partitions  of  the  DBl  database  used  in  the  earlier  noise-free  and  moderate  noise  experiments,  the  only 
difference  being  the  increased  level  of  noise  corruption.  The  noise  corruption  for  each  of  the  25  different 
training/test  samples  is  independent  of  that  for  any  and  all  other  trials.  Figure  8.30  and  table  8.5  show 
that  the  differentially-generated  logistic  linear  classifier  is  /ess  efficient  than  its  probabilistically-generated 
counterparts  for  this  nominal  training  sample  size  and  this  high  amount  of  noise  corruption  (SNR  =  1.2  dB). 
The  average  empirical  test  sample  error  rale  is  16.0%  for  the  differentially-generated  classifier,  versus  14.7% 
(MSE)  and  14.4%  (CE)  for  the  probabilistically-generated  classifiers.  The  differentially-generated  classifier 
exhibits  an  empirical  discriminant  variance  of  1.7  x  I0~^,  and  both  probabilistically-generated  classifier 
exhibit  an  empirical  discriminant  variance  of  approximately  1.8  x  I0~*.  The  right-hand  side  of  figure  8.30 
shows  that  differential  learning  never  generates  the  classifier  with  the  lowest  empirical  test  sample  error 
rate;  instead,  probabilistic  learning  via  the  Kullback-Leibler  information  distance  (CE)  does.  This  is  because 
the  SNR  of  1 .2  dB  resulting  from  the  noise-corruption  with  =  0.2  has  altered  the  class-conditional 
pdfs  of  the  digits  in  such  a  way  that  they  have  become  reasonably  good  approximatior ;  to  homoscedastic 
Gaussian  pdfs.  As  a  result,  the  CE-generated  logistic  linear  classifier  is  a  good  approximation  to  the  proper 
parametric  model  of  the  ver,  ooisy  compressed  feature  vector:  it  can  learn  the  very  noisy  compressed  images 
more  efficiently  than  its  differentially-generated  counterpart  can  (recall  section  3.6).  Indeed,  although  the 
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Figure  8.28:  Very  noisy  versions  of  (he  digiis  shown  in  Figure  8.S.  The  order  of  the  images  has  been 
randomized  (the  sequence  differs  from  those  in  Figures  8.28  and  8..^4)  so  that  (he  digits  they  represent  cannot 
be  inferred  from  their  position  on  the  page.  The  correct  labels  for  these  examples  are  shown  in  Figure  8.36. 
We  encourage  the  reader  to  classify  these  images  prior  to  looking  at  the  correct  labels.  Figure  8.34  shows 
the  original  2S6-pixel  noisy  images  from  which  these  linearly  compressed  images  were  derived  (the  image 
sequence  is  different). 


Figure  8.29:  Parameters  of  the  differential  logistic  linear  classiFier  generated  from  the  First  of  25  very 
noisy  DB I  database  training  samples.  Each  of  the  50  samples  (25  training  and  25  test,  each  containing 
approximately  600  randomly  selected  examples)  is  corrupted  with  a  different  realization  of  the  high-intensity 
noise  source.  The  entropy  of  these  parameters  is  3.85,  reflecting  the  increased  entropy  of  the  noise-corrupted 
images  versus  those  with  less  or  no  noise  (cf.  Figures  8.24  and  8.6). 
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MSE-generated  logistic  linear  classiHer  is  not,  strictly  speaking,  the  proper  parametric  model  of  the  noisy 
compressed  feature  vector,  it  is  also  a  good  approximation  thereto.  As  a  result,  the  MSE-generated  logistic 
linear  classifier  consistently  exhibits  a  lower  empirical  test  sample  error  rate  than  the  differentially-generated 
classifier’s.  Table  8.5  confirms  these  facts  by  showing  that  the  differential  learning  strategy’s  EREs  are 

RE  [Aa  ,  Ap-MSE  I  '{wi  •  .w<r}  .G(0)]  =  2.01  and  RE[Aa,Apce|  {ni . w/r}  ,G(0)]  =  2.28 

for  the  logistic  linear  hypothesis  class  and  the  1 .2  dB  SNR. 

As  we  have  stated  before,  the  statistical  literature  details  an  extensive  collection  of  hypothesis  testing 
procedures  (e.g.,  see  [140])  by  which  it  is  possible  to  determine  whether  or  not  a  chosen  hypothesis  class 
is  a  good  approximation  to  the  proper  parametric  model  of  the  feature  vector.  If  the  improper  hypothesis  is 
rejected  (i.e.,  if  the  parametric  model  is  determined  to  be  proper),  then  we  are  justified  in  using  the  appropriate 
form  of  probabilistic  learning  and  we  are  justified  in  expecting  that  the  resulting  classifier  will  be  a  good 
approximation  to  the  efficient  classifier  for  small  training  sample  sizes.  If,  on  the  other  hand,  the  improper 
hypothesis  is  not  rejected  (i.e.,  if  the  parametric  model  is  determined  to  be  improper),  then  we  are  better  off 
using  differential  learning,  which  is  guaranteed  to  produce  the  relatively  efficient  classifier  for  large  training 
sample  sizes  (and  generally  does  so  for  small  training  sample  sizes). 

Hunun  Recognition  of  the  Very  Noisy  Images 

Figure  8.31  replicates  the  box  plots  in  the  left-hand  side  of  figure  8.30  alongside  box  plots  that  summarize 
the  empirical  error  rates  of  fifteen  human  subjects.  The  human  subjects  were  asked  to  classify  the  forty 
images  of  figure  8.28,  and  their  empirical  error  rates  were  computed  according  to  (8. 1 )  using  i;  =  40.  Like 
the  moderately  noisy  experiment,  the  very  noisy  human  experiment  was  not  rigorously  matched  against  the 
machine  experiments,  so  it  was  almost  surely  biased  against  the  human  subjects  for  the  reasons  described 
earlier. 

Fifteen  subjects  classified  the  64-pixel  images  in  figure  8.28,  and  fifteen  different  subjects  classified  the 
2S6-pixel  "parent”  images  in  figure  8.34.  Figure  8.31  (right)  shows  that  the  median  empirical  test  sample 
error  rate  for  humans  was  37%  for  the  compressed  images  —  more  than  2^  times  the  CE-generated  logistic 
linear  classifier’s  median  rate.  As  in  the  moderately  noisy  experiment,  the  human  subjects’  discriminant 
variance  was  more  than  an  order  of  magnitude  higher  than  the  logistic  linear  classifiers’  (indicated  by  the 
comparative  spans  of  the  box  plots  for  the  compressed  image  experiments).  At  this  point,  we  encourage  the 
reader  to  classify  the  images  in  figures  8.28  and  8.34;  then  determine  your  empirical  error  rates  by  comparing 
your  classifications  with  the  answers  in  figures  8.36  and  8.38. 

Note  that  the  disparity  between  the  error  rates  of  the  human  subjects  who  classified  the  un-compressed 
noisy  images  and  those  who  classified  the  compressed  images  is  substantially  less  than  it  was  for  the 
moderately  noisy  images  (the  median  empirical  error  rate  was  30%  for  the  un-compressed  high-noise  images, 
versus  37%  for  the  compressed  high-noise  images:  see  figure  8.31,  far  right  and  right,  respectively).  Simply 
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put,  when  the  image  SNR  is  greater  than  2  dB,  the  human  subjects  distinguish  the  digits  in  the  un-compressed 
noisy  images  more  easily  than  they  can  in  the  compressed  versions.  As  the  SNR  drops  to  1.2  dB,  the 
un-compressed  images  become  nearly  as  hard  to  recognize  as  their  compressed  counterparts. 

We  are  not  qualified  to  ponder  whether  or  not  humans  have  and  use  proper  parametric  models  for 
learning  — a  question  that  goes  well  beyond  our  interest  and  expertise.  We  have  included  these  relatively 
un-controlled  human  experiments  simply  to  illustrate  that  the  differences  among  the  three  logistic  linear 
classifiers  are  insignificant  when  compared  to  the  differences  between  the  machine  and  human  experiments. 
All  the  machine  learning  approaches  out-classified  the  human  subjects  by  a  substantial  margin  when  the 
digits  were  corrupted  by  large  amounts  of  noise.  We  find  this  result  interesting  in  its  implications  for  future 
comparisons  of  synthetic  (i.e.,  machine-based)  and  organic  (e.g.,  human)  learning  systems. 


Learning  and  Recognizing  the  Very  Noisy  Images  with  the  Linear  Hypothesis  Class 

Figure  8.32  summarizes  the  empirical  test  sample  error  rates  for  classifiers  generated  from  the  linear 
hypothesis  class  by  differential  and  probabilistic  learning.  The  statistics  are  obtained  from  the  same 
25  independent  randomly-generated  partitions  of  the  DB  I  database  used  in  the  high-noise  logistic  linear 
experiments.  Figure  8.32  and  table  8.S  show  that  the  differentially-generated  linear  classifier  is  more  efficient 
than  its  probabilistically-generated  counterpart  for  this  high  amount  of  noise  corruption  (SNR  =  1.2  dB). 
The  average  empirical  test  sample  error  rate  is  1 5.4%  for  the  differentially-generated  classifier,  versus 
1 7. 1%  for  the  probabilistically-  (MSE)-generated  classifier.  Both  classifiers  exhibit  about  the  same  empirical 
discriminant  variance  (~  1.6  x  10“^).  The  right-hand  side  of  figure  8.32  shows  that  the  differentially- 
generated  linear  classifier  consistently  generates  the  classifier  with  the  lowest  empirical  test  sample  error  rate. 
Probabilistic  learning  via  MSE  generates  a  classifier  with  an  empirical  test  sample  error  rate  that  is  typically 
about  1 .8%  higher  than  (or  1.13  times)  that  of  its  differentially-generated  counterpart.  The  MSE-generated 
classifier's  empirical  MSDE  is  about  2^  times  the  differentially-generated  classifier's.  This  is  reflected  in 
the  differential  learning  strategy's  ERE,  which  is  RE  [Aa  .  Ap.mse|  {«!.•••  .wx}  .G{0)]  =  0.40  for  the 
linear  hypothesis  class  and  the  1 .2  dB  SNR. 

These  results  stand  in  contrast  to  those  for  the  logistic  linear  classifier  (cf.  figures  8.30  and  8.32).  The 
probabilistically-generated  linear  classifier  is  consistently  worse  than  its  difTerentially-generated  counterpart 
because,  as  in  the  moderately  noisy  experiments,  the  linear  hypothesis  class  remains  an  improper  parametric 
model  of  the  noisy,  compressed  DBI  feature  vector  for  all  SNR  values.  As  a  result,  differential  learning's 
asymptotic  efficiency,  which  holds  for  any  and  all  hypothesis  classes,  generates  the  relatively  efficient  linear 
classifier  for  these  training  sample  sizes  of  n  «  6(X).  Again,  experiments  were  not  conducted  with  the 
modified  Gau.ssian  RBF  hypothesis  class  because  it  is  clearly  an  improper  parametric  model  of  both  the 
noise-free  and  noisy  digits  —  differential  learning  would  surely  generate  the  most  efficient  RBF  classifier. 


A  5  Results  with  Noise 
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Figure  8.30:  Left;  Test  sample  classification  summaries  for  the  650-parameter  logistic  linear  classifier 
employing  differential  learning  ( )  and  two  forms  of  probabilistic  learning  ( y\p ).  The  summaries  are 
based  on  25  independent  trials  in  which  the  DB I  database  is  randomly  partitioned  into  very  noisy  training 
and  test  samples,  each  containing  approximately  600 examples.  The  box  plots  are  a  non-parametric  depiction 
of  the  empirical  lest  sample  error  rate’s  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average 
empirical  test  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier's 
MSDE.  Right:  The  increase  in  the  discriminant  error  of  the  two  probabilistically-generated  models  over  the 
differentially-generated  model  on  a  trial-by-trial  basis.  These  box  plots  show  that  differential  learning  never 
generates  the  classifier  with  the  lowest  empirical  error  rate.  This  is  because  the  logistic  linear  hypothesis  class 
employing  probabilistic  learning  approximates  the  proper  parametric  model  of  the  very  noisy  feature  vector: 
the  class-conditional  pdf  for  each  digit  is  now  dominated  by  noise  (SNR  =  l.2dB)  and  is  approximately 
Gaussian  distributed,  owing  to  the  compression  scheme  we  employ  (see  sections  8.5. 1  and  8.5.2). 
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Figure  8.31:  Left:  The  test  sample  classification  summaries  shown  in  figure  8.30  for  the  650-paramcter 
(64-pixels/digit)  logistic  linear  classifier  employing  differential  learning  ( A  a  )  and  two  forms  of  probabilistic 
learning  ( Ap ).  Right:  Classification  summaries  for  fifteen  human  subjects  asked  to  classify  the  40  64-pixel 
examples  shown  in  figure  8.28.  Far  Right:  Classification  summaries  for  fifteen  different  human  subjects 
asked  to  classify  the  40  256-pixel  examples  shown  in  figure  P  '  from  which  the  compressed  versions  in 
figure  8.28  are  derived. 
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Figure  8.32:  Left:  Test  sample  classiHcation  summaries  for  the  650-parameier  linear  classifier  employing 
differential  learning  ( )  and  the  MSE  form  of  probabilistic  learning  ( Ap ).  The  summaries  are  based  on 
25  independent  trials  in  which  the  DBI  database  is  randomly  partitioned  into  very  noisy  training  and  test 
samples,  each  containing  approximately  600  examples.  The  box  plots  are  a  non-parametric  depiction  of  the 
empirical  test  sample  error  rate's  distribution  over  the  25  trials;  the  whisker  plots  depict  the  average  empirical 
test  sample  error  rate  plus  and  minus  one  standard  deviation,  thereby  characterizing  each  classifier’s  MSDE. 
Right:  The  increase  in  the  discriminant  error  of  the  probabilistic  model  over  the  differentially-generated 
tm^el  on  a  trial-by-trial  basis.  Because  the  probabilistically-generated  linear  classifier  remains  an  improper 
parametric  model  of  the  compressed  digits  with  decreasing  SNR,  differential  learning  always  generates  the 
linear  classifier  with  the  lowest  empirical  error  rate.  Although  differential  learning  yields  the  relatively 
efficient  classifier  for  the  linear  hypothesis  class,  it  does  not  generate  the  efficient  classifier.  The  efficient 
classifier  is  approximated  by  the  logistic  linear  hypothesis  class/probabilistic  learning  (Kullback-Leibler) 
combination,  a  proper  parametric  mc^el  approximation  that  exhibits  a  slightly  lower  average  empirical  error 
rate  (cf.  figure  8.30). 


Finally,  note  that  the  test  sample  error  rates,  empirical  MSDE,  etc.  for  the  differentially-generated  linear 
classifier  are  all  slightly  lower  (i.e..  better)  than  those  for  the  differentially-generated  logistic  linear  classifier 
(cf.  figures  8.30  and  8.32).  This  phenomenon  is  at  odds  with  the  trends  we  see  in  the  noise-free  experiments. 
We  suspect  tbut  have  not  substantiated)  that  it  might  relate  to  the  issue  of  model  complexity  versus  training 
sample  size.  That  is,  the  functional  complexity  of  the  logistic  linear  hypothesis  class  might  be  somewhat 
higher  than  that  of  the  linear  hypothesis  class.  If  this  is  true,  then  the  logistic  linear  classifier  might  have 
excessive  functional  complexity  for  a  small  training  sample  of  very  noisy  digits,  resulting  in  its  slightly 
higher  empirical  MSDE. 

8.6  Summary 

The  AT&T  DBI  optical  character  recognition  (OCR)  task  serves  to  illustre*<*  the  theoretical  findings  of 
part  I.  We  have  shown  that  compressing  the  digit  images  allows  us  to  reduce  the  complexity  of  the  classifiers 
we  employ  in  the  (X!R  task.  By  lowering  the  classifier  complexity  we  improve  the  generalization  of  all 
classifiers,  but  the  improvement  is  most  pronounced  for  the  differentially-generated  classifiers.  This  is 


8.6  Sunwian' 
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because  differential  learning  is  guaranteed  to  generate  the  relatively  efficient  classifier,  regardless  of  the 
choice  of  hypothesis  class.  Although  large  training  sample  sizes  are  necessary  to  guarantee  that  differential 
learning  will  be  efficient,  the  trait  generally  holds  for  small  training  sample  sizes  as  well  (i.e.,  as  long  as  the 
hypothesis  class  is  an  improper  parametric  model  of  the  data). 

Learning  to  recognize  noisy  versions  of  the  compressed  DB I  digits  provides  us  with  the  conditions  under 
which  differential  learning  is  not  the  most  efficient  learning  strategy  for  small  training  sample  sizxs.  Because 
the  very  noisy  compressed  digits  are  nearly  Gaussian  distributed  with  homoscedastic  class-conditional 
pdfs,  the  CE-generated  logistic  linear  classifier  constitutes  a  proper  parametric  model  of  the  noisy  data; 
probabilistic  learning  is  therefore  the  more  efficient  learning  strategy.  As  we  have  mentioned  a  number  of 
times  already,  there  are  well-known  procedures  for  assessing  whether  or  not  the  hypothesis  class  is  a  proper 
parametric  model  of  the  feature  vector,  so  it  is  relatively  straightforward  to  detect  the  circumstances  under 
which  differential  learning  might  not  be  the  most  efficient  strategy.  Absent  a  strong  indication  to  the  contrary, 
differential  learning  is  the  prudent,  efficient  choice. 

Readers  familiar  with  Geman,  Bienenstock,  and  Doursat's  lovely  paper  on  regression  and  classification 
with  neural  networks  [41]  (we  will  refer  to  the  authors  as  “GBD"  henceforth)  will  recognize  our  use  of 
their  noise  protocol  to  corrupt  the  DBI  digits.  Those  readers  will  also  note  that  our  noise-free  average 
empirical  error  rates  of  2%  and  very  noisy  rates  of  14.5%  are  substantially  lower  than  GBD’s,  which  were 
approximately  16%  and  40%,  respectively.'^  Part  of  the  disparity  surely  stems  from  their  using  training 
sample  sizes  of  only  200;  ours  were  600.  However,  we  conclude  by  virtue  of  our  own  control  experiments 
with  probabilistic  learning  that  much  of  the  disparity  stems  from  their  use  of  probabilistically-generated 
high-complexity  classifiers.  Indeed,  their  typical  MSE-generated  classifier  had  O  (6700]  parameters,  whereas 
ours  have  only  650  parameters.  To  be  sure,  our  simple  classifiers  are  generally  incapable  of  modeling  the 
DB  I  data  with  low  functional  error,  but  our  objective  is  merely  to  model  the  data  with  low  discriminant  emr 
—  an  objective  that  we  achieve  consistently  well  with  differentially-generated  low-complexity  classifiers. 

Our  results  lead  us  to  dispute  GBD’s  implication  that  a  “good  representation”,  determined  a  priori 
by  careful  engineering,  is  a  prerequisite  for  good  generalization  when  the  ultimate  objective  is  pattern 
classification.''  Instead,  we  argue  that  an  efficient  learning  strategy  is  the  most  important  prerequisite 
for  good  generalization.  Given  this,  one  can  employ  simple,  potentially  improper  parametric  models  (i.e., 
potentially  poor  representations,  as  defined  by  GDB),  yet  achieve  robust  generalization  —  a  theoretical  fact 
by  the  proofs  of  part  I,  clearly  illustrated  by  the  DBI  experiments  of  this  chapter.  Indeed,  the  robust  beauty 
of  differentially-generated  models  is  that  they  need  not  be  proper  to  yield  robust  generalization.  Thus,  one's 
choice  of  representation  is  no  longer  a  critical  factor  on  which  the  classifier's  ability  or  failure  to  discriminate 

'^Our  error  rales  are  for  the  6S0-parameier  lopstic  linear  classifier,  which  is  the  simplest  form  of  multi-layer  perceptron  (MLP).  one 
having  no  hidden  layer  units.  The  logistic  linear  classifier  recognizes  M-pixel  images.  GBD’s  error  rales  are  for  an  MLP  with  25  hidden 
layer  units,  a  suhsianlially  more  complex  hypothesis  class  possessing  6685  parameters.  Their  MLP  recognized  2.56-pixel  images. 

’’We  wholeheartedly  support  their  implication  when  the  objective  is  function  approximation,  as  it  is  in  regression  tasks.  We  remind 
the  reader,  however,  that  regression  and  pattern  classiftcalion  are  two  very  different  tasks. 
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Figure  8.33;  Moderately  noisy  256-pixel  digits  formed  by  flipping  the  binary  pixels  of  the  original  DB 1 
database  with  probability  0. 1 .  This  form  of  artificial  noise  corruption  is  described  in  [4 1  ].  The  signal-to-noise 
ratio  (SNR)  of  these  images  is  3.7dB.  The  digits  shown  in  figure  8.23  are  generated  by  linearly  compressing 
these  images.  The  correct  labels  for  the  digits  are  shown  in  figure  8.37.  We  encourage  the  reader  to  classify 
these  ’-'lages  prior  to  looking  at  the  correct  labels. 
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Figure  8.34:  Very  noisy  256-pixel  digits  formed  by  flipping  the  binary  pixels  of  the  original  DB  1  database 
with  probability  6.2.  The  signal-to-noise  ratio  (SNR)  of  these  images  is  1 .2dB.  The  digits  shown  in  figure  8.28 
are  generated  by  linearly  compressing  these  images.  The  correct  labels  for  the  digits  are  shown  in  figure  8.38. 
We  encourage  the  reader  to  classify  these  images  prior  to  looking  at  the  correct  labels. 
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Figure  8.35:  Correct  labels  for  the  digits  in  figures  8.23. 
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Figure  8.36:  Correct  labels  for  the  digits 

in  figures  8.28. 
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Figure  8.37:  Correct  labels  for  the  digits 

in  figures  8.33. 
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Figure  8.38:  Correct  labels  for  the  digits  in  figures  8.34. 


Chapter  9 

Medical  Diagnosis  with  Differential 
Learning ' 


Outline 

We  use  a  simple  logistic  linear  classifier  employing  differential  learning  to  diagnose  avascular  necrosis 
(AVN)  of  the  femoral  head,  a  potentially  crippling  hip  joint  disorder.  The  diagnosis  is  rendered  from 
magnetic  resonance  images.  Specifically,  we  repeat  the  experiments  of  Manduca,  Christy,  and  Ehman 
[90];  we  compare  the  diagnostic  accuracy  of  a  differentially-generated  logistic  linear  classifier  and  two 
probabilistically-generated  controls  (including  the  logistic  regression  model)  with  their  original  results. 
When  presented  with  approximately  sixty  training  images  and  subsequently  evaluated  on  the  same  number 
of  test  images,  the  differentially-generated  logistic  linear  classifier  discriminates  between  healthy  and  AVN 
compromised  femoral  heads  with  a  5.9%  error  rate.  This  error  rate  is  slightly  lower  than  the  7.5%  error  rate 
of  humans  without  formal  training  in  radiology,  reported  in  [90].  The  differentially-generated  logistic  linear 
classifier  generalizes  better  than  the  probabilistic  controls  and  the  best  previous  machine  model,  a  multi-layer 
perception  having  approximately  24  times  the  number  of  parameters  (6,164,  versus  257  for  the  logistic  linear 
classifier). 

9.1  Introduction 

Avascular  necrosis  of  the  femoral  head  is  a  disease  in  which  the  blood  supply  to  the  head  of  the  femur  is 

restricted,  causing  loss  of  bone  marrow.  Manduca,  Christy,  and  Ehman  have  used  neural  networic  classifiers 

to  detect  the  presence  of  this  disorder  from  magnetic  resonance  images  (MRIs)  of  40  adult  patients  [90].  The 

image  database  they  generated  for  the  task  contains  1 25  images,  5 1  %  of  which  indicate  the  presence  of  AVN. 

Details  of  the  database  are  given  in  [90].  Figure  9.1  shows  fourteen  examples  of  these  MRI  images:  each 

'  We  thank  Dr.  Armando  Manduca  of  the  Mayo  Clinic/Foundation  for  providing  us  with  the  magnetic  resonance  image  data  for  this 
task  along  with  statistics  and  insights  from  the  original  experiments  in  [90]. 
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image  is  a  1024-pixel  ( 32  x  32 )  image,  and  the  pixels  have  four-bits  of  precision  (i.e.,  they  have  16  possible 
values:  dark  image  pixels  represent  negative  values,  and  light  pixels  represent  positive  values).  The  image  is 
of  the  femoral  head,  the  ball  at  the  top  of  the  femur  that  mates  with  the  hip  socket.  As  figure  9. 1  illustrates, 
the  presence  of  AVN  is  manifest  as  dark  regions  in  the  light-shaded  ovally-shaped  bone  mass.  These  dark 
regions  indicate  the  absence  of  water-containing  bone  marrow  and  fat  —  conclusive  evidence  of  AVN.  The 
dark  annular  region  surrounding  the  marrow  and  fat  is  cortical  bone  (i.e.,  the  bone's  bard  outer  surface).  The 
surrounding  light  material  is  fluid  in  the  joint  space;  the  surrounding  dark  material  is  generally  cartilage.^  It 
is  obvious  from  these  images  that  there  is  some  variance  in  the  size  of  the  femoral  head  among  subjects.  As  a 
result,  a  correct  diagnosis  in  part  hinges  on  being  able  to  distinguish  the  cortical  bone  and  cartilage  (present 
in  all  of  the  images)  from  AVN  sites. 

We  begin  by  learning  all  125  images  with  a  simple  classifier  possessing  the  single  logistic  linear 
discriminant  function  (section  7.2.2)^ 

g,(X|0)  =  [l  4- exp(-X''o)]''  .  (9.1) 

where  X  is  the  N  =  1024  pixel  image  vector,  X'  is  the  N  -f  I  =  1025  dimensional  augmented  feature 
vector  formed  by  adding  a  single  element  of  unit  value  to  X  (see  (7.2) ),  and  0  e  &  =  is  the 

A+1  dimensional  parameter  or  wcig/if  vector.  The  single  discriminant  function  gi(Xl0)  G  [0,1]  is 
associated  with  a  healthy  diagnosis.  That  is,  when  gi  (X  |  ^ )  >  0.5 ,  the  classifier  makes  a  healthy  diagnosis; 
when  gi(X  1 0)  <  0.5  the  classifier  makes  an  AVN  diagnosis.  We  can  cast  this  single  output  classifier 
into  the  canonical  form  described  in  section  2.2  1  simply  by  imagining  a  second  discriminant  function 
g2(Xl&)  =  1  —  gi(Xl®)  associated  with  an  AVN  diagnosis. 

Figure  9.2  (left)  illustrates  the  parameters  (or  weights)  fonned  when  this  logistic  linear  classifier  learns  to 
diagnose  all  1 25  images  correctly  using  differential  learning.  Dark  pixels  in  the  left  display  represent  negative 
weights,  and  light  pixels  represent  positive  weights.  The  far-left  column  of  the  left  display  contains  only 
one  vertically-centered  pixel.  This  pixel  represents  the  “bias”  parameter  corresponding  to  the  unit-value 
element  prepended  to  X  in  order  to  form  the  augmented  feature  vector  X'  of  (7.2).  The  gray  shade 
of  the  far-left  pixel  column  represents  the  value  zero  (for  reference).  The  lighter  weights  have  positive 
values  and  correlate  with  dark  AVN-compromised  regions  in  the  MRl  images.  The  darker  weights  have 
negative  values  and  correlate  with  dark  regions  in  the  MRI  images  that  are  common  to  healthy  examples 
(e.g.,  cortical  bone  and  cartilage).  A  visual  comparison  of  figure  9.2  (left)  with  the  AVN-compromised 
examples  in  figure  9.1  confirms  these  relationships.  One  notable  aspect  of  the  1025  weight  image  is  its  low 

’  We  thank  Dr.  Martha  McDaniel,  M.D.,  of  the  VA  Medical  Center,  White  River  Junction.  VT.  and  the  Dartnwuth-Hitchctxrk  Medical 
Center.  Lebanon,  NH.  for  her  primer  on  femoral  head  MRI  interpretation.  In  truth,  the  process  of  image  interpretation  is  complex, 
performed  by  experts  with  extensive  training.  In  addition,  the  spin  sequence  used  in  generating  the  MRI  has  a  significant  impact  on  how 
the  image  is  interpreted.  Our  description  of  the  image  composition  is  therefore  a  general  one. 

■'We  use  a  single  discriminant  function  for  this  C  =  2-cla.ss  problem  because  it  has  one-half  the  number  of  parameters  a  classifier 
with  two  discriminant  functions  would  have.  Excess  complexity  is  anathema. 
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Figure  9.1 :  Fourteen  1024-pixel  examples  of  healthy  (bottom  row)  and  AVN  compromised  (top  row)  femoral 
heads.  The  number  below  each  image  is  its  index  number  in  the  1 2S-image  database  described  in  [90]  (our 
indeces  run  from  0  -f  124). 


Figure  9.2:  Left:  The  parameters  of  a  1024-pixel  logistic  linear  classifier,  obtained  by  differentially  learning 
all  1 25  example  images.  Light  parameters  (or  wei^ts)  are  positive  and  detect  AVN-related  dark  regions  in 
the  image;  dark  weights  are  negative  and  detect  dark  regions  in  the  image  (cortical  bone  and  cartilage)  not 
associated  with  AVN.  Weight  smoothing  (described  in  section  M.2)  is  applied  during  learning  to  minimize 
the  parametric  entropy  (definition  M.  1 )  of  the  weights  (i.e.,  the  amount  of  information  they  store).  This  in 
turn  reduces  discriminant  variance  across  learning  trials.  Right:  A  histogram  of  the  weights  in  the  left  figure. 
Note  the  parametric  entropy  of  the  weights  is  2.2,  reflecting  the  low  variance  in  the  distribution  of  weights 
caused  by  smoothing.  This  low  variance/paiametric  enuopy  accounts  for  the  low-contiast  in  the  weight 
display  on  the  left. 
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contrast;  the  gray  scale  of  weights  changes  abruptly  in  only  a  few  regions  of  the  display.  This  is  because 
the  classifier  employs  weight  smoothing  (described  in  section  M.2)  during  learning  in  order  to  minimize 
the  parametric  entropy  (definition  M.  1 )  of  its  weight  vector,  an  empirical  measure  that  we  use  to  gauge  the 
weight  vector’s  information  content.  Weight  smoothing  reduces  the  classifier’s  discriminant  variance,  since 
only  the  information  essential  to  learning  is  retained.  Because  the  classifier  learns  all  1 25  examples  with  a 
moderate  amount  of  weight  smoothing  (  k  =  0.05  )*  its  weight  vector’s  parametric  entropy  is  a  relatively 
low  2.16,  which  reflects  the  low  variance  in  the  histogram  of  the  weights  (figure  9.2,  right). 

These  factors  indicate  that  the  effective  number  of  classifier  parameters  [97]  is  much  lower  than  1025. 
Furthermore,  they  sugge.st  that  the  degree  of  image  resolution  necessary  for  a  correct  diagnosis  is  relatively 
low.  As  a  result,  we  generate  a  database  of  linearly  compressed  256-pixel  ( 16  x  16)  images  from  the 
original  database  of  1024-pixel  (32  x  32)  images.  The  linear  lossy  compression  algorithm  is  described  in 
section  M.3.  Figure  9.3  shows  the  compressed  versions  of  the  MRIs  in  figure  9.1.  Figure  9.4  (left)  illustrates 
the  weights  formed  when  a  257-parameter^  logistic  linear  classifier  learns  to  diagnose  all  1 25  compressed 
images  correctly  using  differential  learning.  This  classifier  also  employs  weight  smoothing  (  k  =  0.02 ). 
As  with  the  higher-resolution  classifier,  the  lighter  weights  have  positive  values  that  correlate  with  dark 
AVN-compromised  regions  in  the  MRI  images,  and  the  darker  weights  have  negative  values  that  correlate 
with  dark  regions  in  the  MRI  images  that  are  common  to  healthy  examples.  Figure  9.4  (right)  indicates  that 
the  257-element  parameter  vector  has  higher  parametric  entropy  than  its  1025-element  counterpart.  This  is 
reflected  in  the  increased  variance  of  the  257  weights’  histogram  and  suggests  that  the  average  weight  in 
the  lower  complexity  classifier  encodes  more  discriminant  information  than  the  average  weight  in  the  higher 
complexity  classifier. 

The  lower  resolution  MRIs  are  definitely  harder  to  leam  than  the  high-resolution  images.  The  1025- 
parameterclassifier  learns  all  the  examples  with  V’  ^  0.3,  whereas  the  257-parameter  classifier  must  reduce 
t/'  to  0.07  before  it  can  leam  all  125  examples.  Figure  9.5  shows  the  final  state  of  the  low-complexity 
(257-parameter)  classifier  after  learning.  All  examples  lie  on  the  reduced  discriminant  continuum  (i.e., 
the  line  between  the  upper  left  and  lower  right  comers  of  reduced  discriminator  output  space)  because  the 
classifier  has  only  one  output;  our  earlier  specification  of  the  phantom  second  output  ensures  this.  Note  that 
the  hardest  examples  (including  examples  48  and  49)  engender  veiy  small  positive  discriminant  differentials; 
the  classifier  has  low  confidence  in  its  diagnosis  of  these  examples. 


^The  weight  smoothing  parameter  k  has  a  value  between  zero  and  one  (see  appendix  M).  A  value  of  zero  results  in  no  smoothing; 
a  value  of  one  forces  all  weights  to  have  the  same  value.  From  a  qualitative  perspective,  any  value  of  k  >  0. 1  is  large,  any  value  of 
K  >  0.01  is  moderate,  and  any  value  of  k  <  0.01  is  small. 

'There  is  one  discriminant  function,  and  the  augmented  compressed  feature  vector  has  A/  +  I  =  257  elements.  Therefore  the 
classifier  has  257  total  parameters. 
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Figure  9.3:  The  images  in  figure  9.1,  linearly  compressed  to  256  pixels. 


Figure  9.4:  Left:  The  parameters  of  a  256-pixei  differentially-generated  logistic  linear  classifier,  obtained  by 
learning  all  1 25  example  images.  Weight  smoothing  is  appli^  during  learning  to  reduce  discriminant  variance 
across  learning  trials.  Right:  A  histogram  of  the  weights  in  the  left  figure.  Note  the  parametric  entropy  of 
the  weights  is  2.9,  reflecting  a  moderate  increase  in  the  weight  distribution's  variance  (cf.  figure  9.2).  This 
increase  variance/parametric  entropy  accounts  for  the  slightly  increased  contrast  in  the  lower-resolution 
weight  display. 
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Figure  9.5:  The  257-parameter  differentially-generated  logistic  linear  classifier’s  output  state  —  as  projected 
onto  the  reduced  discriminant  continuum  — after  learning  all  125  AVN  examples.  This  output  state 
corresponds  to  the  parameters  shown  in  figure  9.4  (right).  Note  that  the  low  CFM  confidence  parameter 
V’  =  .077  (resulting  in  a  steep  sigmoidal  form  of  CFM)  is  necessary  to  learn  examples  48  and  49  (see 
figure  9.8).  These  examples  generate  very  small  positive  discriminant  differentials,  corresponding  to  the 
classifier’s  low  confidence  in  its  diagnosis. 
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9.2  Recognition  Results 

In  order  to  assess  how  well  the  low-complexity  logistic  linear  classiHer  generalizes,  we  repeat  an  experiment 
described  in  [90]  in  which  the  1 25  example  database  is  randomly  partitioned  into  training  and  test  samples  of 
approximately  equal  sizes.  Speciftcally,  we  run  55  2-rold  cross  validation  trials;  in  each  trial,  each  example 
is  randomly  assigned  to  the  training  sample  with  probability  ^  ;  those  examples  not  assigned  to  the  training 
sample  constitute  the  test  sample.  The  classifier  learns  for  1 25  epochs  or  for  5  epochs  beyond  the  point  at 
which  all  training  examples  are  classified  correctly,  whichever  is  less.  All  the  trials  are  conducted  according 
to  the  general  protocols  set  forth  in  section  8.2.  The  confidence  parameter  is  set  to  t/’  =  I ,  and  the  weight 
smoothing  factor  is  k  =  0.02 .  These  measures  are  taken  to  reduce  the  classifier’s  empirical  discriminant 
variance  across  trials:  training  sample  sizes  average  62  examples,  a  small  number  even  for  the  compressed 
images;  as  a  result,  we  want  the  classifier  to  learn  only  those  examples  in  which  it  has  high  confidence. 
Figure  9.6  illustrates  the  final  reduced  discriminator  output  state  after  a  typical  learning  trial  for  which  the 
training  sample  size  is  58  and  the  test  sample  size  is  67.  Training  examples  are  shown  as  dark  gray  dots  and 
test  examples  are  shown  as  black  triangles.  The  classifier  learns  all  the  training  examples  (most  with  high 
confidence),  but  misclassifies  three  test  examples. 

From  a  diagnostic  perspective,  the  null  hypothesis  is  that  the  image  is  of  a  healthy  femur;  the  alternative 
hypothesis  is  that  the  image  is  of  an  AVN-compromised  femur.  The  classifier’s  sensitivity  to  the  alternative 
hypothesis  is  the  probability  that  it  will  detect  AVN  when  it  is  indeed  present.  The  classifier’s  specificity 
is  the  probability  that  it  will  not  incorrectly  classify  a  healthy  image  as  AVN-compromised.  Figure  9.7 
(left)  shows  the  estimated  .sensitivity  and  specificity  (with  95%  confidence  bounds)  for  the  classifier  in 
figure  9.6.  This  classifier’s  sensitivity  is  91.4  (+8.6/-I  l.7)%,  so  it  fails  to  detect  8.6  (+1 1.7/-8.6)%  of  the 
AVN-compromised  test  examples.  The  classifier’s  specificity  is  100  (+0.0/-8.9)%,  so  it  never  incorrectly 
classifies  a  healthy  test  example  as  AVN-compromised  (modulo  the  confidence  bound).  Figure  9.7  (right) 
shows  the  receiver  operating  characteristic  (R(X^)  (134,  sec.  2.2.2]  for  the  classifier’s  ability  to  detect  AVN. 
The  discontinuities  in  the  ROC  are  due  to  the  small  test  t  le  size  (67)  on  which  it  is  based.  Taken  in  the 
context  of  this  small  sample  size  and  the  large  confidence  bounds  on  the  sensitivity/specificity  statistics,  the 
ROC  power  of  0.93  cannot  be  viewed  as  anything  more  than  a  gross  measure  of  the  classifier’s  detection 
capabilities.  That  is,  we  know  that  the  classifier  has  reasonably  good  AVN  detection  characteristics,  but  there 
is  insufficient  data  to  state  specifically  how  good  these  characteristics  are. 

One  fact  is  certain  from  the  trials;  the  information  loss  due  to  image  compression  has  an  impact  on 
the  classifier’s  ability  to  detect  AVN.  Figure  9.8  illustrates  why.  The  figure  shows  the  original  1024-pixe! 
images  (top)  of  the  three  examples  misclassified  by  the  classifier  in  figure  9.6.  The  three  images  below  the 
high-resolution  ones  are  their  compressed  versions.  After  compression  it  is  difficult  to  discern  the  signs  of 
AVN  in  examples  37  and  48;  example  49  looks  compromised  to  the  human  eye,  but  the  low  complexity 
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Figure  9.6:  The  257-parameterdifferentially-generated  logistic  linear  classifier's  output  state  —  as  projected 
onto  the  reduced  discriminant  continuum  —  after  a  typical  learning  trial  for  which  the  training  sample  size  is 
58  images  selected  randomly  from  the  pool  of  1 25  total  images.  The  model  learns  all  training  examples,  but 
misclassifies  three  of  the  67  test  examples:  37, 48,  and  49  (see  figure  9.8).  These  misclassiHcations  appear 
on  the  “incorrect”  side  of  the  reduced  discriminant  boundary;  their  discriminant  differentials  are  -.89,  -.99, 
and  -.99  respectively.  The  large  negative  differentials  indicate  that  the  classifier  is  relatively  confident  in  its 
incorrect  diagnosis,  based  on  the  58  training  examples  it  has  learned. 
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Figure  9.7:  Left:  The  differentially-geperated  logistic  linear  classiHer's  sensitivity  and  specificity  for  the 
trial  depicted  in  figure  9.6;  95%  confide  ice  bounds  are  given  for  o  and  /7 ,  and  are  computed  as  described 
in  (62].  Right:  The  associated  receiver  operator  characteristic  for  detecting  an  AVN-compromi.sed  femora) 
head  (based  on  the  test  sample  for  this  trial  —  middle  row  of  the  table  on  the  left). 


(i.e.,  257-parameter)  logistic  linear  classifier  consistently  evaluates  this  example  as  healthy.  In  reducing 
the  classifier  complexity  by  a  4  :  I  compression  of  the  feature  vector’s  dimensionality,  we  reduce  the 
classifier’s  discriminant  variance.  This  ensures  that  the  classifier's  error  rate  remains  reasonably  consistent 
across  independently  drawn  training/test  samples.  The  price  we  pay  for  this  decreased  discriminant  variance 
is  an  increase  in  discriminant  bias:  some  examples  (e.g.,  37, 48,  and  49)  become  difficult  to  classify  at  lower 
resolution. 

This  reveals  an  interesting  manifestation  of  the  bias/variance  tradeoff  common  to  all  estimators.  In  the 
case  of  AVN  diagnosis  from  MRI  images,  the  nature  of  the  disease  sometimes  requires  high-resolution 
imagery  for  a  'iirect  diagnosis  (i.e.,  for  low  discriminant  bias).  This  requirement,  in  turn,  dictates  higher 
classifier  complexity.  The  increased  complexity  demands  more  data  to  ensure  that  the  classifier’s  discriminant 
variance  is  low  (i.e.,  that  the  more  complex  classifier  will  be  sure  to  make  consistent  diagnoses). 

Figure  9.9  summarizes  the  results  of  the  55  2-fold  cross  validation  trials  run  on  the  compressed  AVN 
images.  The  left  side  of  the  figure  compares  'differential  learning  with  two  forms  of  probabilistic  learning:  the 
257-parameter  logistic  linear  classifier  is  used  in  each  case,  and  all  aspects  of  learning  are  identical  except  for 
the  objective  function  (learning  strategy)  employed  (see  section  8.2).  The  classifier  employing  differential 
learning  ( )  has  an  average  empirical  test  sample  error  rate  of  5.9%,  compared  with  7.4%  and  7.9%  for  the 
MSE  and  CE-generated  variants.  The  classifier's  empirical  discriminant  variance  is  approximately  8  x  10“^ 


P(  False  Positive ) 
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256  Pixel  Images 


Figure  9.8;  The  three  test  images  that  the  differentialfy-generaied  logistic  linear  classifier  of  figure  9.6 
misclassifies.  The  classifier  is  presented  the  256-pixel  images  in  the  bottom  row  oi  the  figure:  for  these 
particular  examples,  compression  from  1024  to  256  pixels  results  in  the  lo.ss  of  information  critical  to  a 
correct  classification.  In  fact,  the  low-complexity  linear  classifier  consistently  misclassifies  examples  48  and 
49;  it  misclassifies  example  37  in  many  of  the  55  2-fold  cross  validation  trials. 


with  differential  learning  and  9  x  I0~^  with  probabilistic  learning.  The  right  side  of  the  figure  compares 
the  probabilistically  generated  classifiers  with  the  diffe'entially-gcnerated  one  on  a  trial-by-trial  basis.  On 
average,  the  MSE-generated  classifier’s  empirical  test  sample  error  rate  is  1.4%  greater  than  (or  1.2  times) 
the  differentially-generated  classifier's,  with  upoer  and  lower  standard  deviations  as  shown.  In  half  of  the 
tr  als  the  differentially-generated  classifier  does  no  better  than  the  MSE-generated  classifier;  in  three  trials  it 
does  worse  (by  as  much  as  1.9%);  in  the  remaining  trials  it  does  better  (by  as  much  as  9.0%).  On  average, 
the  Kullback-Leibler  (CE)-generated  classifier's  empirical  test  sample  error  rate  is  1 .9%  greater  than  (or  1 .3 
times)  the  differentially-generated  classifier’s,  with  upper  and  lower  standard  deviations  as  shown.  In  one 
quarter  of  the  trials  the  differentially-generated  classifier  does  no  better  than  the  CE-generated  classifier;  in 
four  trials  it  does  worse  (by  as  much  as  5.2%);  in  the  remaining  trials  it  does  better  (by  as  much  as  8. 1  %). 

We  surmise  that  the  differentially-generated  model  is  not  always  better  than  its  probabilistic  counterparts 
due  to  its  high  complexity,  given  the  average  training  sample  size  of  n  =  62 ;  even  the  "low-complexity” 
classifier  has  a  large  number  of  parameters.  As  a  result,  all  of  the  models  exhibit  high  discriminant  variance. 
.Since  differential  learning  guarantees  osymproffc  efficiency  only,  n  is  not  big  enough  for  the  differentially- 


Fij'ure  9.9:  I-eft;  Test  sample  classiricalion  summaries  for  the  257-parameter  logistic  linear  classifier 
employing  differential  learning  ( )  and  two  forms  of  probabilistic  learning  (Ap).  The  summaries  arc 
based  on  55  independent  trials.  In  each  trial,  training  examples  are  drawn  randomly  from  the  set  of  125 
images  with  probability  | ;  those  not  chosen  for  training  formed  the  test  sample.  Note  that  the  CE-generated 
logistic  linear  classifier  is  identically  the  logistic  regression  model  for  this  classification  task  (see  appendix  E). 
Right:  The  increase  in  the  discriminant  error  of  the  two  probabilistic  models  over  the  differentially-generated 
model  on  a  trial-by-trial  basis. 
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Figure  9.10:  Test  sample  classification  summaries  for  the  low-complexity  (257-parameter)  differentially- 
B  neralcd  logistic  linear  classifier  (far  left),  Manduca.  Christy,  and  Ehman’s  (901  2050-parameter  MSE- 
generated  logistic  linear  classifier  (middle  left),  and  their  best  MSE-generated  non-linear  classifier  (tniddlt; 
right),  a  high-complexity  multi-layer  pcrccptron  with  6  hidden  units  and  6,164  parameters.  Note  that  the 
differentially-generated  model’s  average  estimated  discriminant  error  is  the  same  as  the  high-complexity 
probabilistic  model’s,  but  its  estimated  discriminant  variance  is  one-half  that  of  the  high-complexity  model 
(cf.  table  9.1).  Each  of  ten  human  subjects  performed  one  2-fold  cross  validation  trial  (far  right);  the 
differentially-generated  logistic  linear  classifier  compares  favorably  with  the  human.s. 
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Estimated  DBias,  DVar,  and  MSDE 

Hypothesis 

Class 

Learning  Strategy 

Aa  (CFM) 

Ap(MSE) 

Ap(CE) 

55  trials,  256-pixel  images 

Logistic  Linear 
(257  parameters) 

DBias 

5.9  X  10-^ 

7.4  X  10-^ 

7.9  X  10-2 

DVar 

7.7  X  I0-* 

8.8  X  lO-"* 

9.1  X  JO-* 

MSDE 

4.3  X  10“’ 

6.3  X  10-^ 

7.2  X  10-^ 

ERE 

— 

0.68 

0.60 

Manduca,  Christy,  and  Ehman  [90];  24  trials,  1024-pixel  images® 

Logistic  Linear 
(2050  parameters) 

DBias 

— 

8.4  X  10-^ 

— 

DVar 

— 

1.2  X  10-^ 

— 

MSDE 

— 

8.3  X  10-^ 

— 

MLP 

(6i64  parameters) 

DBias 

— 

6.2  X  10-^ 

— 

DVar 

— 

1.7  X  10"’ 

— 

MSDE 

— 

5.6  X  10“’ 

— 

Manduca,  rhristy,  and  Ehman  [90]:  10  human  subjects,  1  trial  each” 

Human 
(10  subjects) 

DBias 

— 

7.5  X  10-2 

— 

DVar 

4.6  X  10-^ 

— 

MSDE 

6.1  X  10-^ 

— 

Table  9. 1 :  Estimated  discriminant  bias,  discriminant  variance,  and  MSDE  for  the  2S7-parameter  logistic 
linear  classifiers  generated  by  differential  learning  (  Aa  )  via  the  CFM  objective  function  and  probabilistic 
learning  ( Ap )  via  the  MSE  and  CE  objective  functions.  Estimates  are  based  on  55  2-fold  cross-validation 
trials  in  which  the  AVN  database  is  randomly  partitioned  into  training  and  test  samples,  each  containing 
approximately  62  examples.  Estimates  are  also  shown  for  Manduca,  Christy,  and  Ehman's  2050-parameter 
logistic  linear  and  6 1 64-parameter  multi-layer  perceptron  (MLP)  classifiers,  both  generated  probabilistically 
with  the  MSE  objective  function  [90].  Those  estimates  are  based  on  24  2-foid  cross  validations  trials;  the 
training/test  sample  partitions  for  their  trials  are  different  from  ours.  Finally,  estimates  are  shown  for  10 
Human  subjects  [90];  each  performed  one  2-fold  cross  validation  trial. 


# 


Results  from  Manduca,  Christy,  and  Ehnuin  [90]  used  with  permission. 
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generated  model  to  exhibit  the  lowest  empirical  error  rate  in  all  trials.  By  making  the  feature  vector  more 
Gaussian-Iike,  compressing  the  MRl  images  makes  the  logistic  linear  classifier  a  better  approximation  to  the 
proper  parametric  model  of  the  feature  vector.  This  also  diminishes  the  relative  efficiency  of  differential 
learning  vis-a-vis  probabilistic  learning.  Nevertheless,  the  differentially-generated  model  is  rarely  worse,  and 
on  average  better  than  its  probabilistically-generated  counterparts.  In  short,  its  empirical  MSDE  is  lower. 
Table  9. 1  shows  that  the  differentially-generated  257-parameter  logistic  linear  classifier’s  ERE  (definition8.6, 

page  255)  is  RE  [Aa  ,  Ar.mse  |  {«! .  •  -  •  ,G(0)]  =  0.68  versus  probabilistic  learning  via  MSE  and 

RE  [Aa  iAp-ceI  {ni , ...  ,G(0)]  =  0.60  versus  probabilistic  learning  via  CE.  The  comparative 

efficiency  of  differential  learning  stems  mainly  from  its  lower  discriminant  bias  (5.9%  for  differential 
learning,  versus  7.4%  (MSE)  and  7.9%  (CE)  for  probabilistic  learning).* 

Finally,  we  compare  our  results  for  these  55  trials  with  Manduca,  Christy,  and  Ehman’s  results  for  24  trials 
in  which  the  data  is  split  into  training  and  test  samples  as  described  above  [90].  Figure  9.9  summarizes  the 
over-all  comparison  (results  cannot  be  compared  on  a  trial-by-trial  basis,  because  we  use  different  randomly 
partitioned  training/test  samples).  Manduca,  Christy,  and  Ehman’s  logistic  linear  classifier  uses  the  original 
1024-pixel  images  and  has  an  output  for  each  class,  so  it  has  2050  parameters.  It  learns  probabilistically 
with  the  MSE  objective  function  and  a  conjugate  gradient  search  algorithm.  Its  average  empirical  test  sample 
error  rate’  (and,  as  a  result,  its  empirical  discriminant  bias)  is  8.4%;  its  empirical  discriminant  variance  is 
1 .2  X  10“^.  Their  best  multi-layer  perceptron  (MLP)  classifier  for  the  5(V50  training/test  sample  splits  has 
6  hidden  units,  two  output  units,  and  uses  the  original  1024-pixel  images.  As  a  result,  the  model  has  6,164 
parameters.  Like  its  logistic  linear  counterpart,  it  learns  probabilistically  with  the  MSE  objective  function  and 
a  conjugate  gradient  search  algorithm.  Its  average  empirical  test  sample  error  rate  (and  empirical  discriminant 
bias)  is  6.2%;  its  empirical  discriminant  variance  is  1.7  x  I0~’.  As  described  earlier,  our  logistic  linear 
classifier  has  257  parameters;  it  learns  with  the  CFM  objective  function  and  simple  gradient  ascent  search 
algorithm.  Its  average  empirical  test  sample  error  rate  (and  empirical  discriminant  bias)  is  5.9%;  its  empirical 
discriminant  variance  is  7.7  x  10“^. 

Compared  with  Manduca,  Christy,  and  Ehman’s  linear  classifier,  our  differentially-generated  model 
has  a  lower  average  empirical  test  sample  error  rate,  and  somewhat  lower  empirical  discriminant  variance. 
Compared  with  Manduca,  Christy,  and  Ehman’s  best  non-linear  classifier,  our  differentially-generated  logistic 
linear  model  has  the  same  empirical  test  sample  error  rate  (our  5.9%  is  not  significantly  better  than  their 
6.2%)  and  approximately  one  half  the  discriminant  variance.  Virtually  none  of  this  empirical  discriminant 
variance  difference  is  attributable  to  our  larger  number  of  trials  (as  the  critical  reader  might  suspect):  we  have 

•'We  assume  an  estimated  Bayes  error  rate  of  P,  (^ga\-e.f)  =  0%  for  the  AVN  diagnosis  task.  The  actual  number  is  probably 
non-zero  since  Manduca.  Christy,  and  Ehman  report  that  human  radiology  experts  will  not  commit  to  a  diagnosis  on  all  1 25  ittuiges  in 
the  database  [90]. 

^All  statistics  attributed  to  Manduca.  Christy,  and  Ehman  are  either  published  in  |90]  or  have  been  provided  to  the  author  by 
Dr.  Manduca  via  personal  correspondence. 
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looked  at  the  empirical  discriminant  variance  of  several  randomly  selected  24-trial  sub-sets  of  our  55  trials, 
and  it  remains  fairly  steady  about  8  x  10“^.  Thus,  the  differentially-generated  model  combines  the  low  error 
rate  of  the  complex  probabilistic  model  and  the  consistency  of  the  simple  probabilistic  model. 

Manduca,  Christy,  and  Ehman  ran  trials  in  which  10  humans  not  trained  in  radiology  learned  to  diagnose 
AVN  from  half  of  the  125  images;  they  were  subsequently  tested  on  the  other  half.  The  human  subjects 
had  an  average  empirical  test  sample  error  rate  (and  empirical  discriminant  bias)  of  7.5%;  their  discriminant 
variance  was  4.6  x  10“^.  Thus,  the  low  complexity  logistic  linear  classifier  employing  differential  learning 
is,  on  average,  at  least  as  good  as  the  human  novice  for  this  limited  diagnosis  task. 

9.3  Summary 

We  find  that  a  relatively  simple  logistic  linear  classifier  learns  to  diagnose  avascular  necrosis  of  the  femoral 
head  from  a  single  low-resolution  MRI  image  with  an  error  rate  of  5.9  (44.9/-4.2)%.*  The  classifier  is  more 
consistent  than  the  best  independently  developed  (probabilistic)  model,  which  exhibits  approximately  twice 
as  much  discriminant  variance  across  independent  test  trials.  In  addition,  the  classifier’s  error  rate  compares 
favorably  with  the  7.5%  error  rate  of  human  novices  who  are  provided  with  the  same  training  and  testing 
data. 

Learning  to  diagnose  AVN  presents  an  unavoidable  tradeoff  between  discriminant  bias  and  discriminant 
variance.  The  simple  classifier  exhibits  lower  discriminant  variance  at  the  cost  of  increased  discriminant  bias 
when  complexity  is  reduced  by  compressing  the  original  high-resolution  MRI  images  into  lower-resolution 
ones;  the  details  essential  to  a  correct  diagnosis  are  simply  lost  in  the  more  difficult  examples.  This  leads  us 
to  acknowledge  that  there  are  learning  tasks  in  which  consistently  low  recognition  error  rates  demand  large 
training  samples.  In  such  tasks,  the  key  to  consistently  low  error  rates  is  in  subtle  details,  which  can  be 
gleaned  only  from  a  large  training  sample.  This  does  not  diminish  the  advantages  of  differential  learning,  but 
it  does  show  that  a  tradeoff  between  discriminant  bias  and  variance  is  sometimes  inevitable,  no  matter  what 
learning  strategy  is  employed. 


*The  upper  and  lower  bounds  on  this  error  rate  are  (more  rigorous)  confidence  bounds,  rather  than  the  standard  deviations 
quoted  in  section  9.2. 
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Remote  Sensing  with  Differential 
Learning ' 


Outline 

We  describe  a  series  of  remote  sensing  experiments  conducted  in  collaboration  with  the  Digital  Mapping 
Laboratory,  School  of  Computer  Science,  Carnegie  Mellon  University.  We  use  a  modified  RBF  classifier 
employing  differential  learning  (DRBF)  to  interpret  multi-spectral  imagery  from  the  Daedalus  airborne 
(remote  sensing)  scanner  system.  The  interpretation  procedure  involves  classifying  individual  image  pixels, 
which  represent  64  square  meters  of  earth  surface  material,  into  eleven  categories  of  natural  and  man-made 
materials  — a  preliminary  step  in  automated  map  generation  and  various  environmental  analysis  tasks. 
The  DRBF  classifier  has  132  parameters  and  exhibits  a  29%  error  rate  on  the  interpretation  task.  The 
maximum-likelihood  (probabilistic)  model  currently  used  for  this  task  has  847  parameters  and  exhibits  a  46% 
error  rate.  Most  of  the  DRBF's  reduced  error  rate  is  attributable  to  its  sub-sampling  the  training  data  during 
learning;  1 2%  of  the  reduction  is  attributable  to  differential  leaming/lower  model  complexity. 

10.1  Introduction 

The  interpretation  of  remote  sensing  imagery  is  an  integral  part  of  a  diverse  set  of  earth  sciences  (e.g.,  map 
generation,  crop  analysis,  forestry,  land  use,  assessing  the  environmental  effects  of  airborne  and  water-borne 
pollution,  etc,).  The  imagery  is  obtained  from  satellite  and  airborne  optical  sensors,  which  are  sensitive  to 
visible  as  well  as  near  infrared  and  short-wave  infrared  light  reflected  from  the  earth’s  surface.  The  Digital 
Mapping  Laboratory,  School  of  Computer  Science,  Carnegie  Mellon  University,  is  developing  computer 
systems  for  the  automated  interpretation  of  remote  sensing  imagery  obtained  from  the  Daedalus  airborne 

'We  thank  David  McKeown  and  Stephen  Fotd  of  (he  Digital  Mapping  Laboraioty  for  providing  us  with  the  nHiUi-spectral  ima^ry 
for  this  task,  introducing  us  to  the  business  of  remote  sensing,  and  providing  us  with  both  the  DRBF  and  maximum-likelihood  classifier 
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multi-spectral  imaging  system  (e.g.,  [36]).  Deadalus  generates  images  comprising  eleven  spectral  bands  (ten 
span  the  continuum  from  short-wave  visible  light  to  short-wave  infrared  light  and  one  is  in  the  thermal  region 
—  see  [36,  fig.  Ij).  The  image  has  a  ground  sample  distance  of  approximately  8  meters,  so  a  single  pixel 
represents  a  patch  of  earth  with  an  area  of  about  64  meters  squared. 

As  a  preliminary  step  in  generating  maps  by  machine,  each  pixel  in  an  image  is  classified  according  to 
its  “ground  truth”  class.  Eleven  classes  are  used  in  the  current  system:  asphalt,  concrete,  coniferous  tree, 
deciduous  tree,  deep  water,  grass,  shadow,  shallow  water,  soil,  tile,  and  turbid  water.  The  pixel-by-pixel 
interpretation  is  performed  by  a  classifier  that  has  previously  learned  examples  of  each  ground  truth  class.  The 
feature  vector  X  representing  each  pixel  has  eleven  elements,  corresponding  to  the  eleven  bandpass  sensor 
outputs  of  the  Daedalus  system.  Both  the  test  imagery  and  training  pixels  are  taken  from  a  3(XX)  x  7(X)  pixel 
(134  square  kilometer)  image  of  the  Washington,  D.C.  metropolitan  area. 

The  remote  sensing  community  frequently  uses  Gaussian  maximum-likelihood  (ML)  classifiers  to 
interpret  multi-spectral  imagery.  In  the  case  of  the  Daedalus  scanner  data,  each  of  the  ML  model’s  eleven 
discriminant  functions  has  the  form 


g,(X|»)  =  -iffi-,  „0,|) 

E/i(x) 

/=! 

where 


/.(X)  = 


I 

(27r)V 


(X  -  H,) 


(10.2) 


and  the  rth  mean  /x,  and  covariance  matrix  E,-  are  subsets  of  the  over-all  discriminator  parameter  vector  B . 
Since  there  are  C  =  II  discriminant  funcdons  for  the  eleven  ground  truth  classes,^  the  classifier  has  a  total 
of  1 1  ( 1 1  -f  66)  =  847  parameters’  (i.e.,  B  ^  &  =  tR**’ ).  The  parameters  of  fi,  and  E,  are  estimated 
from  the  training  sample  by  the  method  of  maximum-likelihood  (e.g.,  [28,  sec.  6.S]).  The  resulting  classifier 
therefore  learns  probabilistically  ar.d  is  the  fully  parametric  counterpart  to  the  partially  parametric  logistic 
linear  hypothesis  class  described  in  section  7.2.2.  In  the  jargon  of  parametric  statistical  pattern  recognition, 
the  fully-parametric  maximum  likelihood  classifier  is  the  normal-based  linear  discriminant  analysis  paradigm, 
and  its  partially-parametric  counterpart,  the  logistic  linear  classifier,  is  known  as  the  logistic  discriminant 
analysis  paradigm  (e.g.,  [91]). 

^It  is  purely  coincidental  that  the  number  of  ground  Irath  classes  C  and  Ibe  feature  vector  dimensionality  N  ate  both  1 1 . 

’The  classifier  comprises  f  =  1 1  discriminant  functions,  associated  with  the  same  number  of  ground  truth  classes.  Each 
discriminant  function  has  II  parameters  that  correspond  to  the  /V  =  1 1 -dimensional  class-conditional  feature  vector  mean.  The 
covariance  matrix  has  At’  =  121  parametera,  but  it  is  symmetric,  so  it  effectively  has  only  -i-  l)/2  =  66  parameters.  Therefore, 
the  classifier  has  a  total  of  II  (II  -I-  66)  =  847  parameters. 
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We  compare  the  maximum-likelihood  classiFier  with  a  differentially-generated  modiHed  radial  basis 
function  classifier  (appendix  K)  having  no  hidden  layer  units.  Each  of  the  DRBF’s  eleven  discriminant 
functions  has  the  form 


g,(Xie)  =  exp 


(X  -  tif  (X  -  /X,) 


(10.3) 


where  aj  denotes  the  discriminant  function's  single  variance  parameter.  As  a  result,  the  DRBF  classifier  has 
a  total  of  11(11  -t-  I)  =  132  parameters  (i.e.,  R  €  0  =  —  fewer  than  one  sixth  as  many  as  the 

maximum-likelihood  model. 


10.2  Training  Data 

The  single  remote  sensing  training  sample  comprises  10,616  pixels  of  the  eleven  ground-truth  classes,  taken 
from  the  3000  x  700  pixel  image  of  the  Washington,  D.C.  area  described  earlier.  Although  the  four 
small  test  sites  described  in  section  10.3  are  taken  from  the  same  over:all  image,  all  of  the  training  data 
is  from  different  locations  in  the  image,  so  the  test  and  training  samples  are  disjoint.  Consequently,  this 
is  a  site-dependent  classification  task  (if  the  training  data  were  taken  from,  say,  Atlanta,  Georgia,  and  the 
test  data  were  taken  from  Washington,  D.C.,  the  task  would  be  truly  site-independent).  Table  1 0.1  shows 
the  number  of  example  pixels  n,-  for  each  ground  truth  class  Ul,-  (i  =  !,...,(!)  in  the  training  sample. 
The  maximum  likelihood  classifier  learns  simply  by  computing  the  maximum-likelihood  estimates  of  the 
eleven  discriminant  functions’  class-conditional  means  and  covariances  based  on  this  sample.  By  (1 0.1) 
and  (10.2),  the  maximum-likelihood  classifier  implicitly  assumes  that  all  class  prior  probabilities  are  equal. 
The  DRBF  classifier  learns  differentially  by  maximizing  the  CFM  objective  function  with  respect  to  0 
over  the  set  of  all  training  examples;  this  is  done  by  an  iterative  search  that  employs  gradient  ascent  and 
the  backpropagation  algorithm  (e.g.,  [119,  120]),  altered  for  use  with  the  modified  RBF  non-linearities  of 
(10.3).  In  order  to  speed  learning  convergence,  the  DRBF  classifier's  parameters  are  initialized  as  described 
on  page  211:  the  class-conditional  mean  vectors  are  initialized  to  their  corresponding  class-conditional 
training  sample  average,  while  the  class-conditional  variance  parameters  are  initialized  to  their  corresponding 
class-conditional  sample  variances. 

A  differential  learning  epoch  comprises  one  iteration  of  the  learning  algorithm  for  each  example  in 
the  training  sample.  Iterative  statistical  learning  procedures  such  as  differential  learning  require  that  the 
empirical  class  prior  probabilities  of  the  training  sample  match  those  of  the  test  sample.  At  the  same 
time,  a  large  training  sample  size  for  each  ground-truth  class  is  desirable,  since  it  better  characterizes  the 
class-conditional  probability  density  function  (pdO-  Because  the  ground-truth  classes  do  not  all  have  the 
same  prior  probabilities  in  the  Washington,  D.C.  area,  the  DRBF  does  not  attempt  to  learn  every  training 
example  each  epoch  (see  table  10.1).  Instead  it  learns  a  fraction  of  each  ground-truth  class  in  a  given  epoch. 
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Training  Sample  Sizes  (  n, ) 

Ground  Truth  Class 

1 

DRBF’s 

Learning 

Probability 

Pi 

asphalt 

1000 

concrete 

1000 

1.0 

coniferous  tree 

616 

0.04 

deciduous  tree 

1000 

1.0 

deep  water 

1000 

1.0 

grass 

1000 

1.0 

shadow 

1000 

1.0 

shallow  water 

1000 

0.3 

soil 

1000 

0.01 

tile 

1000 

0.01 

turbid  water 

1000 

1.0 

Table  10.1:  The  training  sample  sizes  n,-  (f  =  1 , ,C)  for  both  the  maximum-likelihood  and  DRBF 
classiOers.  The  far-right  column  applies  to  the  DRBF  classifier  only.  In  any  given  epoch,  the  DRBF  randomly 
selects  with  probability  p,-  each  of  (he  n,-  training  examples  of  class  U>i:  the  examples  selected  in  a  given 
epoch  are  learned  during  that  epoch;  all  other  examples  are  ignored  in  that  epoch.  This  form  of  randomly 
sub-sampling  the  training  data  each  epoch  effectively  alters  the  empirical  class  prior  probabilities  of  the 
training  sample  so  that  they  are  more  representative  of  their  true  underlying  values. 


At  the  beginning  of  each  epoch  the  DRBF  randomly  selects  with  probability  p,  (or  ignores  with  probability 
1  —  Pi)  each  of  the  n^  training  examples  of  class  Ct*,:  the  examples  selected  in  a  given  epoch  are  learned 
during  that  epoch;  all  other  examples  are  ignored  in  that  epoch.  This  form  of  randomly  sub-sampling  the 
training  data  each  epoch  effectively  alters  the  empirical  class  prior  probabilities  of  the  training  sample  so 
that  they  are  more  representative  of  their  true  underlying  values.  That  is,  the  probabilities  {pi , . . .  ,pc}  in 
table  10. 1  have  been  chosen  so  that  they  approximate  the  prior  probabilities  of  the  ground  truth  classes  in  the 
Washington,  D.C.  area (i.e.,  {Pw(t<^i ) .  •  • . , Prv(tt^c)} ) thus: 

ei  P»v(u;^)  (10.4) 

M 

Reference  [35]  shows  that  this  technique  of  sub-sampling  the  training  sample  each  learning  epoch  reduces 
the  DRBF’s  empirical  test  sample  error  rate  by  about  15%.  Additionally,  it  accounts  for  a  significant  fraction 
of  the  test  differences  between  the  DRBF  and  ML  classifiers  (see  section  10.3). 


10.3  Experimental  Results 
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Figure  10. 1 ;  Top  Left:  Panchromatic  image  of  the  civill  site  (1.2  meter  resolution).  Top  Right:  Composite 
of  the  multi-spectral  data  for  the  civill  site  (8  meter  resolution),  which  the  classifiers  interpret. 


10  J  Experimental  Results 

Figures  10.1  —  10.6  pertain  to  four  sites  in  downtown  Washington,  D.C.,  on  which  we  have  tested  the  DRBF 
and  maximum-likelihood  classifiers.  Figure  1 0.1  shows  the  vicinity  of  the  Civil  Service  and  Department 
of  the  Interior  buildings.  The  image  on  the  left  is  a  high-resolution  panchromatic  image  of  the  site;  the 
image  on  the  right  is  a  lower  resolution  composite  of  the  1 1 -band  multi-spectral  image  that  the  classifiers 
interpret.'*  There  is  no  explicit  relationship  between  the  colors  in  the  multi-spectral  images  of  figure  10.2 
(right)  and  those  in  the  classification  maps  of  figure  10.2.  The  three  images  in  figure  10.2  depict  the  ground 
truth  for  the  site  (middle  —  generated  by  a  human  using  an  interactive  image  classification  tool)  and  the 
DRBF  (top)  and  maximum-likelihood  (bottom)  classifiers’  interpretations  of  the  multi-spectral  image  in 
figure  10.1.  Classification  errors  occur  at  pixels  for  which  the  color  of  a  classifier's  interpretation  differs 
from  the  ground  truth  image  color.  The  color  legend  to  the  right  in  figure  10.2  explains  the  color-scheme 
used  in  the  ground-truth  and  classification  maps. 

Tables  10.2  and  10.3  summarize  the  DRBF  classifier's  test  results  at  the  civUl  site,  and  tables  10.4  and 
10.5  summarize  the  maximum-likelihood  classifier's  test  results.  Tables  10.2  and  10.4  match  the  ground 
truth  class  names  with  their  class  labels  {U}\,  ...  LJn  ).'  and  they  show  the  top  ten  confusions  made  by  the 

^  All  of  the  panchromatic  images  were  taken  with  a  resolution  of  1 .2  meters  per  pixel  side  (i.e.,  a  pixel  represents  1 .44  square  meters 
of  surface  area);  all  of  the  multi-spectral  images  were  taken  approximately  seven  years  later  at  8  meters/pixel  side  (i.e.,  a  pixel  represents 
64  square  meters  of  surface  area). 

’These  labelAiame  lists  are  give  in  txith  tables  for  quick  reference. 
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Figure  10.2:  Top:  The  DRBF  classifier’s  interpretation  of  the  civill  site.  Middle  The  ground  truth  for  the 
civil  1  site.  Bottom  The  maximum-likelihood  (ML)  classifier’s  interpretation  of  the  civill  site. 
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civill  Ground  Truth  Classes 

tJi 

asphalt 

UJ2 

concrete 

deciduous  tree 

UJa 

grass 

shadow 

0J6 

tile 

(jOi 

coniferous  tree 

UJi 

deep  water 

LUg 

shallow  water 

OJ\o 

soil 

0^11 

turbid  water 

DRBF  Top  Ten  Confusions 


Count  Percent 


deciduous  tree 

grass 

800 

27.3 

deciduous  tree 

asphalt 

570 

19.5 

concrete 

asphalt 

548 

18.8 

deciduous  tree 

shadow 

284 

9.7 

asphalt 

concrete 

244 

m^Qi 

asphalt 

shadow 

180 

■a 

shadow 

asphalt 

163 

20.1 

grass 

deciduous  tree 

156 

13.4 

grass 

asphalt 

144 

12.4 

concrete 

grass 

45 

1.5 

Table  10.2;  Left;  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right;  Top  ten  confusions  made  by 
the  DRBF  classifier  over  the  civill  site. 


Table  10.3;  Confusion  matrix  for  the  DRBF  classifier  over  the  civill  site. 
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civil  1  Ground  Truth  Classes 


ML  Top  Ten  Confusions 


UJi 

asphalt 

002 

concrete 

OOi 

deciduous  tree 

OOi 

grass 

00s 

shadow 

006 

tile 

002 

coniferous  tree 

LOg 

deep  water 

OOg 

shallow  water 

<^\0 

soil 

OOu 

turbid  water 

True  Class 

Misclassified  as 

Count 

Percent 

deciduous  tree 

grass 

893 

30.5 

asphalt 

tile 

876 

20.3 

deciduous  tree 

coniferous  tree 

831 

28.4 

deciduous  tree 

soil 

766 

26.2 

concrete 

soil 

634 

21.7 

asphalt 

soil 

480 

II. 1 

grass 

soil 

330 

28.4 

asphalt 

concrete 

318 

ma 

concrete 

tile 

283 

concrete 

asphalt 

178 

6.1 

Table  10.4:  Left:  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made  by 
the  maximum-likelihood  (ML)  classifier  over  the  civill  site. 


ML  Ground  Truth  Class  Confusion  Matrix 


178  12 


2 


893 


4.1 


28.1  128 


480  6.14  766  .1.10 


4.10S  291 S 


0  0 


0 


M  IS2 


70  15 


41.8 

.18.0 

91.4 

.16.3 

12.0 

22.2 

Table  I O.S:  Confusion  matrix  for  the  maximum-likelihood  (ML)  classifier  over  the  civill  site. 
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Figure  10.3;  Top  Left;  Panchromatic  image  of  the  gaol  site  (1.2  meter  resolution).  Top  Right;  Composite 
of  the  multi-spectral  data  for  the  gaol  site  (8  meter  resolution),  which  the  classifiers  interpiet. 


classifier.  The  percent  confused  is  equal  to  the  number  in  the  “count”  column  divided  by  the  total  number 
of  ground  truth  examples  occurring  in  the  test  sample  (image).  Tables  10.3  and  10.5  are  confusion  matrices; 
they  list  the  ground  truth  example  totals  at  the  bottom,  in  the  “total”  row.  The  class  labels  across  the  top  of 
the  confusion  matrices  indicate  the  actual  (or  true)  ground  uuth  class,  and  the  labels  in  the  left-most  column 
denote  the  class  detected  by  the  classifier,  fhe  diagonal  elements  of  the  confusion  matrix  show  the  number 
of  ground  truth  examples  correctly  classified  or  detected  by  the  classifier;  the  off-diagonal  elements  show  the 
number  of  examples  misclassified  (i.e.,  incorrectly  detected  as  examples  of  another  cl&ss).  As  an  example, 
the  bottom  entry  of  table  10.3  in  the  UJ]  column  indicates  the  percentage  of  asphalt  pixels  that  the  DRBF 
misclassifies  as  some  other  ground  truth  class.  The  right-most  column  of  the  table’s  UJ\  row  indicates  what 
percentage  of  pixels  classified  as  asphalt  actually  represent  some  other  ground  truth  class. 

The  two  bold-face  numbers  in  the  lower  right  comer  of  tables  10.3  and  10.5  show  what  percentage  of 
the  image  pixels  are  correctly  classified  and  what  percentage  are  misclassified  by  the  classifier.  Table  10.3 
shows  that  the  DRBF  classifier  exhibits  a  28  3  (+/-  0.8)%  empirical  error  rao;  on  the  cIvUl  site;  table  10.5 
shows  that  the  maximum-likelihood  classifier  exhibits  a  51  5  (+/-  0.9)%  empirical  error  rate.* 


*We  remind  the  reader  that  empirical  test  sample  error  rates  include  95%  confidence  intervals:  these  are  estimated  uiKler  the 
assuiiqxion  that  the  error  rate  is  binomially  distributed  |53|  Please  see  section  8.2  for  details. 
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Figure  10.4;  Top;  The  DRBF  classifier’s  interpretation  the  gaol  site.  Middle  The  ground  truth  for  the 
gaol  site.  Bottom  The  maximum-likelihood  (ML)  classifier's  interpretation  of  the  gaol  site. 
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I  Cround  Truth  Classes 


Table  10.6:  Left:  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made  by 
the  DRBF  classifier  over  the  gaol  site. 


DRBF  Ground  Truth  Class  Confusion  Matrix 

OeHci«J 

CI.1S5 

Tnie  Class 

m 

HI 

HI 

HI 

HI 

HI 

HI 

B 

Total 

Percent 

CioiW 

False 

Defection 

Me 

OJi 

19 

l<M5 

.569 

.15 

.149 

273 

m 

B 

HI 

HI 

t;590 

666 

33.4 

UJi 

.186 

ISI 

12 

HI 

HI 

■ 

m 

B 

■ 

0 

2879 

85.7 

14.1 

w, 

0 

■ 

i 

HI 

0 

■ 

H 

B 

HI 

0 

271 

86.0 

14.0 

UJ* 

■ 

52 

ISI 

« 

0 

0 

■ 

B 

0 

0 

917 

51.5 

465 

Uy 

182 

■1 

97 

2 

m 

■ 

H 

B 

HI 

0 

1118 

71.7 

261 

OJt, 

■1 

■1 

HI 

HI 

HI 

HI 

■ 

m 

B 

HI 

0 

IB 

(Jjj 

12 

14 

HI 

HI 

HI 

m 

□ 

■ 

B 

HI 

HI 

95 

71.6 

28.4 

■1 

■ 

11 

2 

HI 

m 

■ 

ISI 

0 

HI 

HI 

16 

0.0 

1000 

(jJn 

2.1 

■1 

HI 

HI 

.15 

m 

m 

H 

ISI 

HI 

HI 

58 

0.0 

100.0 

m 

12 

■1 

HI 

HI 

28 

m 

m 

B 

B 

ISI 

HI 

40 

00 

100.0 

UJii 

12 

■1 

iBI 

12 

iH 

m 

B 

B 

HI 

ISI 

IH 

00 

100.0 

Total 

6.155 

4191 

568 

1254 

IKI 

0 

jjHB^H 

'HHHQI 

nmol 

ina 

■iVAJI 

■KMJ 

•t  Cootet 

IBKiTil 

IKU 

- 

. 

rzz 

■ 

dZ 

K21 

False 

Negative 

Rate 

100 

!  81.9 

IH 

.14.3 

80.8 

B 

B 

■ 

■ 

.10.1 

iU,  ■ 

a.sphall 

UJ^ 

cdncretc 

deciduous  tree 

uu 

grass 

bJy 

shadow 

soil 

U)^ 

tile 

OJf. 

coniferous  tree 

U)i) 

deep  water 

Wio 

shallow  water 

UJu 

turbid  water 

DRBP’  Top  Ten  Confusions 

True  Class 

Misclassificd  as 

Count 

Percent 

concrete 

asphalt 

1645 

39.2 

deciduous  tree 

asphalt 

569 

44.1 

asphalt 

concrete 

386 

6.1 

deciduous  tree 

grass 

365 

28.3 

shadow 

asphalt 

349 

27.8 

tile 

asphalt 

273 

77.1 

asphalt 

shadow 

182 

2.9 

deciduous  tree 

shadow 

97 

HI 

concrete 

grass 

52 

1.2 

shadow 

deep  water 

35 

2.8 

Table  10,7:  Confusion  matrix  for  the  DRBF  classifier  over  the  gaol  site. 
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gaol  Ground  Truth  Classes 

U)\ 

asphalt 

OJi 

concrete 

UJy 

deciduous  tree 

UJ4 

grass 

shadow 

C06 

soil 

u>7 

tile 

UJg 

coniferous  tree 

(jJg 

deep  water 

shallow  water 

UJn 

turbid  water 

ML  Top  Ten  Confusions 


concrete 

asphalt 

965 

23.0 

asphalt 

concrete 

733 

11.5 

asphalt 

tile 

615 

deciduous  tree 

soil 

545 

42.2 

shadow 

asphalt 

393 

31.3 

deciduous  tree 

grass 

346 

26.8 

concrete 

soil 

298 

7.1 

concrete 

tile 

246 

5.9 

asphalt 

soil 

234 

iKB 

deciduous  tree 

coniferous  tree 

202 

15.7 

Table  10.8:  Left:  Class  labels  assigned  to  the  1 1  ground  truth  classes.  Right:  Top  ten  confusions  made  by 
the  maximum-likelihood  (ML)  classifier  over  the  gaol  site. 


Table  10.9:  Confusion  matrix  for  the  maximum-likelihood  (ML)  classifier  over  the  gaol  site. 


/  0. 3  Experimental  Kexulls 
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DRBF  error  rate  reduction:  The  error  rate  reduction  realized  by  employing  the  DRBF  classifier  in  lieu 
of  the  maximum-likelihood  (ML)  classifier  is  ‘~*'**MK  _  ^here  denotes  the  number  of  test 

sample  errors  made  by  the  maximum-likelihood  classifier  and  E(»/)drbf  denotes  the  number  made  by  the 
DRBF  classifier. 


The  DRBF  classifier  therefore  reduces  the  maximum-likelihood  classifier’s  empirical  error  rate  by  45%  at 
the  civill  site.  A  review  of  figure  10.2  and  tables  10.2—  lO.S  shows  that  the  maximum-likelihood  classifier 
has  high  false  detection  rates  for  soil,  tile,  and  coniferous  trees.  The  DRBF  classifier  frequently  fails  to  detect 
tile.  Both  classifiers  misclassify  deciduous  trees  as  grass  roughly  30%  of  the  time. 

Figures  10.3  and  10.4  and  tables  10.6  —  10.9  compare  the  two  classifiers  for  the  gaol  site,  which 
includes  the  General  Accounting  Office  building.  Table  10.7  shows  that  the  DRBF  classifier  exhibits  a 
30.1  (+/-0.8)%  empirical  error  rate  at  the  gaol  site;  table  10.9  shows  that  the  the  maximum-likelihood 
classifier  exhibits  a  37.9  (+f  0.8)%  empirical  error  rate.  The  DRBF  classifier  therefore  reduces  the  maximum- 
likelihood  classifier's  empirical  error  rate  by  21%  at  the  gaol  site.  A  review  of  figure  1 0.4  and  tables  10.6  — 
10.9  shows  that  the  classifiers  exhibit  the  same  general  trends  over  the  gaol  site  as  they  do  over  the  civill 
site,  with  a  few  notable  exceptions.  The  DRBF’s  insensitivity  to  tile  is  more  evident,  owing  to  the  higher 
prior  probability  of  that  ground  truth  class  at  the  gaol  site.  More  than  half  of  the  DRBF  errors  occur  when  it 
confuses  asphalt  and  concrete  (we  attribute  this  phenomenon  to  the  large  number  of  parking  lots  with  vehicles 
present  at  the  site,  the  surfaces  of  which  exhibit  both  asphalt  and  concrete-like  spectral  characteristics).  As  a 
result,  the  disparity  between  the  two  classifiers’  empirical  error  rates  is  considerably  lower  at  the  gaol  site. 

Figures  10.5  and  10.6  compare  the  classifiers  over  the  White  House  ( whousel )  and  the  Federal  Bureau 
of  Engraving  ( engravel ),  respectively.  Human-generated  ground  truth  are  not  shown  for  these  sites,  and  we 
omit  a  detailed  statistical  comparison  of  the  classifiers  in  the  interest  of  brevity.  A  visual  comparison  indicates 
that  the  general  trends  exhibited  over  the  civill  and  gaol  sites  are  manifest  at  the  whousel  and  engravel 
sites  as  well.  This  is  confirmed  by  a  comparison  of  the  two  classifiers’  empirical  error  rates  over  these  two 
sites,  which  are  shown  in  table  10.10.  The  table  lists  the  error  rates  of  both  classifiers  on  all  four  sites,  along 
with  95%  confidence  bounds  computed  as  described  in  section  8.2  and  [62].  The  DRBF  classifier  exhibits 
a  28.4  (-1-/-  0.8)%  empirical  error  rate  at  the  whousel  site;  the  maximum-likelihood  classifier  exhibits 
a  5 1 .6  (+/-  0.9)%  empirical  error  rate.  The  DRBF  classifier  therefore  reduces  the  maximum-likelihood 
classifier’s  empirical  error  rate  by  45%  at  the  whousel  site.  The  DRBF  classifier  exhibits  a  29.6  (+/-  0.7)% 
empirical  error  rate  at  the  engravel  site;  the  maximum-likelihood  classifier  exhibits  a  44.7  (+/-  0.8)% 
empirical  error  rate.  The  DRBF  classifier  therefore  reduces  the  maximum-likelihood  classifier’s  empirical 
error  rate  by  34%  at  the  engravel  site.  Based  on  the  combined  test  results  from  all  four  sites,  the 
DRBF  classifier  exhibits  a  29.2  (+/-  0.4)%  empirical  error  rale;  the  maximum  likelihood  classifier  exhibits 
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Estimated  Error  Rates 

Site 

Test  Sample 
Size  1/ 

Maximum 

Likelihood 

Classifier 

DRBF 

Percent  Error  Rate 
Reduction,  DRBF  versus 
ML 

civill 

12,180 

51.5  (+/- 0.9)% 

28.3  (+/-  0.8)% 

45% 

gaol 

37.9  (+/-  0.8)% 

30.1  (+/-  0.8)% 

21% 

whousel 

12.155 

51.6  (+/- 0.9)% 

28.4  (+/- 0.8)% 

45% 

engravel 

15,704 

44.7  (+/-  0.8)% 

29.6  (+/- 0.7)% 

34% 

all  sites 

54,053 

46.1  (+/-0.4)% 

VfHWtRM 

37% 

Table  10. 10;  A  summary  of  the  empirical  test  sample  error  rates  for  both  the  maximum-likelihood  and  DRBF 
classifiers.  The  far-right  column  shows  the  percent  reduction  in  error  rate  realized  by  employing  the  DRBF 
in  lieu  of  the  maximum-likelihood  classifier. 


a  46. 1  (+/-  0.4)%  empirical  error  rate.  The  DRBF  classifier  therefore  reduces  the  maximum-likelihood 
classifier’s  empirical  error  rate  by  37%  over  all  four  sites. 

10  J.1  Interpretation  of  Test  Results 

The  reduced  complexity  and  differential  learning  strategy  of  the  DRBF  classifier  account  for  part  of  the 
37%  improvement  over  the  maximum-likelihood  classifier,  but  they  do  not  account  for  all  of  it.  Recall 
from  section  10.2  that  the  DRBF  classifier  sub-samples  the  training  data  when  learning  — a  procedure 
that  effectively  alters  the  ground  truth  class  prior  probabilities  and  reduces  the  DRBF’s  error  rate  by  15%. 
For  this  reason,  we  suspect  that  more  than  half  of  the  DRBF’s  improvement  over  the  maximum-likelihood 
classifier  has  nothing  to  do  with  differential  learning  and  the  reduced  classifier  complexity  it  allows.  Indeed, 
if  the  DRBF  classifier  learns  without  sub-sampling  the  training  data,  its  empirical  error  rate  over  the  civill 
and  gaol  sites  increases  to  40.4  (+/-  0.6)%.^  Thus,  the  DRBF  classifier  without  sub-sampled  training  data 
reduces  the  maximum-likelihood  classifier’s  empirical  error  rate  by  only  12%  —  a  statistically  significant 
but  not  substantial  amount.  Put  another  way,  it  is  reasonable  to  believe  that  incorporating  estimates  of 
the  ground-truth  class  prior  probabilities  into  the  maximum-likelihood  classifier  via  (10. 1)  — altering  that 
equation  so  that  it  is  an  expression  of  Bayes’  rule  —  would  reduce  the  maximum-likelihood  classifier’s  error 
rate  by  about  25%,  leaving  only  a  12%  advantage  to  the  DRBF  classifier. 

These  differences  notwithstanding,  we  suspect  that  the  DRBF’s  29%  error  rate  is  close  to  the  (minimum) 
Bayes  error  rate  for  X ,  an  1 1 -element  vector  describing  a  single  image  pixel.  The  human  expert  generates 
ground  truth  for  each  site  by  looking  at  the  over-all  image;  thus,  (s)he  exploits  contextual  information  (e.g., 
the  appearance  of  neighboring  pixels)  to  which  the  DRBF  docs  not  have  access.  Given  such  information 


^Comparable  results  for  the  whsusel  and  ctigravcl  sites  have  not  been  compiled. 
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Maximum-Likelihood  Classifier  Differentially- Generated  RBF  (DRBF) 
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Figure  10.5:  Top  Left:  Panchromatic  image  of  the  whousel  site  (1.2  meter  resolution).  Top  Right: 
Composite  of  the  multi-spectral  data  for  the  whousel  site  (8  meter  resolution),  which  the  classifiers 
interpret.  Bottom  Left:  The  maximum-likelihood  classifier's  interpretation  of  the  whousel  site.  Bottom 
R^t:  The  DRBF  classifier’s  interpretation  of  the  whousel  site. 
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Classification  Legend 


Figure  10.6:  Top  Left;  Panchromatic  image  of  the  eiigravel  site  (1.2  meter  resoluuon).  Top  Right: 
Composite  of  the  multi-spectral  data  for  the  engravel  site  (8  meter  resolution),  which  the  classifiers 
interpret  Bottom  Left:  The  maximum-likelihood  classifier’s  interpretation  of  the  engravel  site.  Bottom 
Right:  The  DRBF  classifier’s  interpretation  of  the  engravel  site. 
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in  the  form  of  a  feature  vector  that  includes  the  pixel  of  interest  and  its  surrounding  neighbors,  the  DRBF 
classiTier  would  probably  exhibit  a  significantly  lower  error  rate.  Owing  to  the  geometric  increase  in  the 
number  of  parameters  a  comparable  maximum-likelihood  model  would  require  for  the  expanded  feature 
vector,*  we  doubt  such  a  model  would  exhibit  a  lower  error  rate. 

10.4  Summary 

We  have  used  a  DRBF  classifier  with  132  parameters  to  interpret  multi-spectral  images  of  the  Washing¬ 
ton,  D.C.  area,  pixel-by-pixel.  The  DRBF  exhibits  a  29%  error  rate,  37%  lower  than  the  46%  error  rate 
exhibited  by  the  847-parameter  maximum-likelihood  (probabilistic)  classifier  currently  used  for  this  task. 
The  DRBF  classifier's  reduced  complexity  and  differential  learning  strategy  account  for  approximately  12% 
of  the  improvement  over  the  maximum-likelihood  model;  the  remaining  25%  of  the  improvement  is  due  to 
the  DRBF  classifier's  method  of  sub-sampling  the  training  data  during  learning.  This  sub-sampling  allows 
the  DRBF  classifier  to  adjust  the  training  sample’s  empirical  class  prior  probabilities  so  that  they  match  those 
of  the  test  sample,  thereby  ensuring  that  the  training  sample  is  representative  of  the  test  sample. 


*The  number  of  DRBF  classifier  parameters  increases  linearly  with  N,  the  feature  vector  dimensionality.  The  number  of 
maximum-likelihood  classifier  parameters  increases  as  ,  dooming  the  paradigm  to  Bellman's  curse  of  dimensionality  [13]. 


Chapter  11 

Conclusions 


11.1  Scientific  Contributions 

We  began  this  text  by  stating  that  the  research  herein  is  motivated  by  by  three  convictions:  the  first  is  that  it 
is  not  necessary  to  estimate  probabilities  in  order  to  perform  robust  statistical  pattern  recognition;  the  second 
is  that  there  are  many  real-world  pattern  recognition  tasks  for  which  a  proper  parametric  model  either  does 
not  exist  or  cannot  be  determined;  the  third  is  that  the  simplest  model  of  the  data  is  generally  the  best  one  — 
Occam’s  razor.  These  convictions  motivated  our  research  and  serve  as  the  backdrop  to  what  we  believe  is 
our  principal  contribution  to  the  fields  of  machine  learning  and  statistical  pattern  recognition: 

differential  learning  —  a  discriminative  learning  strategy  for  differentiable  supervised  classi¬ 
fiers  that  guarantees  the  best-generalizing  classifier  allowed  by  the  choice  of  hypothesis  class, 
whatever  that  choice  is;  the  guarantee  always  holds  for  large  training  sample  sizes  and  it  usually 
holds  (i.e.,  it  holds  for  all  improper  parametric  models)  when  the  training  sample  size  is  small. 

Lesser  contributions  —  Chapter-by-chapter,  the  lesser  contributions  are  as  follows: 

•  Defining  two  strategies  by  which  differentiable  supervised  classifiers  can  learn:  probabilistic  and 
differential  (chapter  2). 

•  Defining  two  fundamental  forms  of  the  Bayesian  discriminant  function  that  correspond  to  the 
probabilistic  and  differential  learning  strategies  (chapter  2). 

•  Developing  an  estimation-theoretic  view  of  generalization  (chapter  3): 

-  Defining  the  classifier  as  an  estimator  of  the  Bayes-optimal  classifier. 

-  Defining  estimation-theoretic  measures  of  generalization: 


*  Discriminant  bias 

*  Discriminant  variance 
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*  Mean-squared  discriminant  error  (MSDE) 

--  Ekfining  the  efficient  and  relatively  efficient  classifiers. 

-•  Defining  the  efficient  and  asymptotically  efficient  learning  strategies. 

•  Proving  that  differential  learning  is  accomplished  by  maximizing  the  CFM  objective  function  (chap¬ 
ter  2). 

•  Proving  that  differential  learning  is  asymptotically  efficient  (chapter  3). 

•  Proving  that  differential  learning  requires  the  minimum-complexity  hypothesis  class  necessary  for 
Bayesian  discrimination  (chapters  3  and  6). 

•  Proving  that  minimizing  a  classifier’s  functional  error  does  not  equate  to  minimizing  its  error  rate;  that 
is,  error  measures  are  non-monotonic  (chapters  3  and  5). 

•  Proving  that  probabilistic  learning  is  accomplished  by  minimizing  error  measure  objective  functions 
(chapter  2).' 

•  Proving  that  probabilistic  learning  is  inefficient  (chapter  3). 

•  Defining  proper  and  improper  parametric  models  (chapter  3). 

•  Sketching  and  illustrating  the  proof  that  probabilistically-generated  proper  parameuic  models  can  be^ 
more  efficient  than  their  differentially-generated  counterparts  for  small  training  sample  sizes  (chapters 
3  and  4). 

•  Developing  the  classification  figure-of-merit  (CFM)  objective  function  [55]  and  deriving  a  synthetic 
form  of  it  that  engenders  reasonably  fast  learning  (chapter  5,  and  appendix  D). 

•  Developing  distribution-dependent  bounds  on  the  training  sample  size  requirements  for  good  general¬ 
ization  via  differential  and  probabilistic  learning  ([51]  and  chapter  6). 

•  Showing  that  the  minimum-complexity  requirements  of  differential  learning  are  consistent  with  the 
tenets  of  VC  theory  (section  3.5). 

11.2  Philosphical  Implications  of  Differential  Learning 

There  are  at  least  six  “philosophical”  implications  of  differential  learning  that  warrant  discussion;  the  most 
strenuous  objections  to  differential  learning  that  we  have  fielded  to  date  pertain  to  them. 

'  We  remind  the  reader  that  our  claim  of  originality  here  is  restricted  to  our  definition  of  and  proofs  relating  to  the  general  error 
measure,  a  significanl  fraction  of  which  is  due  lo  Barak  Pearlmutter.  Proofs  pertaining  to  specific  error  measures  precede  our  work,  and 
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Estimating  probabilities  —  Differential  learning  seeks  only  to  learn  the  identity  of  the  most  likely  class 
of  the  feature  vector  over  its  domain  —  a  discriminative  strategy  that  equates  to  learning  the  Bayes-optimal 
class  boundaries  on  the  feature  vector’s  domain.  This  learning  objective  is  substantially  less  rigorous  than 
that  of  probabilistic  learning,  which  seeks  to  learn  all  the  a  posteriori  class  probabilities  of  the  feature  vector 
over  its  domain.  Indeed,  the  lower  degree  of  rigor  accounts  in  part  for  the  efficiency  of  differential  learning. 
By  accepting  differential  learning  we  abandon  the  goal  of  estimating  probabilities.  This  is  heresy  to  some 
traditionalists.  In  fairness  to  them,  we  acknowledge  that  there  are  statistical  pattern  recognition  tasks  for 
which  probabilistic  estimates  are  essential.  If,  for  example,  we  are  going  to  caution  a  potential  coronary 
bypass  surgery  candidate  against  an  operation  because  our  computer  diagnostic  system  indicates  that  she  will 
not  survive  the  procedure,  then  we  might  require  a  firm  probabilistic  estimate  of  mortality  (both  with  and 
without  surgery)  on  which  to  base  the  ultimate  decision  to  operate  or  not.  Hidden  Markov  Models  (HMMs) 
and  Markov  Random  Fields,  which  are  used  to  recognize  patterns  that  evolve  over  space  and/or  time,  rely 
heavily  on  robust  probabilistic  estimates.  Since  differential  learning  does  not  generate  these,  it  is  not  likely 
that  it  can  work  well  in  hybrid  systems  that  use  neural  network  classifiers  to  estimate  the  probabilities  for 
HMM  systems  (e.g.,  [126]). 

If  we  need  a  probabilistic  estimate,  we  are  compelled  to  allocate  the  resources  necessary  for  a  robust 
estimate.  On  the  basis  of  section  6.4,  this  might  require  extremely  large  training  sample  sizes  (depending  on 
whether  or  not  our  parametric  model  is  a  good  approximation  to  the  proper  one).  If  the  model  is  approximately 
proper  and/or  the  data  can  be  collected,  then  we  need  only  expend  the  time,  money,  and  effort  necessary 
for  the  collection.  If,  on  the  other  hand,  the  model  is  improper  and  there  is  no  plausible  way  to  obtain  the 
data,  then  we  must  face  the  fact  that  reliable  probabilistic  estimates  simply  cannot  be  made.  Under  these 
constraints,  classification  will  be  more  reliable  if  we  abandon  the  untenable  goal  of  estimating  probabilities 
and  employ  differential  learning. 

Ultimately,  we  should  consider  whether  or  not  probabilistic  estimates  are  essential  to  our  objectives  in  the 
context  of  whether  or  not  our  parametric  model  is  approximately  proper.  The  decision  science  literature  is  full 
of  studies  showing  that  humans  —  indeed  human  experts  —  are  remarkably  bad  at  estimating  probabilities 
and  applying  them  consistently  to  their  process  of  decision  making  (see  for  example  [69,  27]).  Nevertheless, 
we  revere  our  human  pattern  recognition  capabilities  and  have  great  confidence  in  their  ability  to  guide  us  to 
rational  decisions.  This  paradox  might  be  explained  by  the  differences  between  probabilistic  and  differential 
learning,  although  we  make  no  claim  of  biological  plausibility.  We  leave  it  to  the  reader  to  decide  when 
probabilistic  learning  is  imperative;  absent  this  imperative  and/or  the  knowledge  of  a  proper  parametric 
model  of  the  data,  we  suggest  the  differential  learning  strategy. 


How  Easy  is  it  to  Approximate  the  Proper  Parametric  Model  —  We  have  sketched  the  proof  that 
probabilistic  learning  generates  the  efficient  classifier  when  the  hypothesis  class  is  a  proper  parametric  model 
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of  (he  data  (section  3.6);  the  experiments  of  sections  4.2  and  8.5.4  confirm  the  phenomenon.  In  all  other 
cases  —  that  is,  when  the  hypothesis  class  is  an  improper  parametric  model  of  the  data  —  differential 
learning  generates  the  relatively  efficient  classifier  for  both  small  and  large  training  sample  sizes.  Thus, 
the  choice  between  probabilistic  and  differential  learning  hinges  on  whether  or  not  the  hypothesis  class  is 
a  proper  parametric  model  of  the  feature  vector.  We  remind  the  reader  that  if  (he  proper  parametric  model 
exists  —  indeed  it  does  not  always  exist  —  it  is  unique.  In  practice,  however,  we  need  only  approximate 
it  in  order  to  obtain  efficient  probabilistic  learning.  The  wide  acceptance  and  use  of  logistic  regression 
models  and  multi-layer  perceptrons  follows  from  their  use  of  logistic  non-linearities.  Many  feature  vectors 
have  well-separated,  unimodal  class-conditional  densities;  consequently,  the  logistic  function  allows  a  good 
approximation  to  the  feature  vector’s  a  posteriori  class  probabilities  —  the  model  is  approximately  proper. 
Thus,  (he  real  question  that  should  deode  between  probabilistic  and  differential  learning  is  whether  this 
reasonable  approximation  holds  in  a  given  case.  We  encourage  the  reader  to  ask  this  question  by  rigorous 
hypothesis  testing,  and  act  on  the  answer  as  appropriate. 

The  bias/variance  tradeoff  —  As  discussed  in  chapter  3.  there  is  a  difference  between  a  classifier’s 
fimctional  bias  and  variance  and  its  discriminant  bias  and  variance.  In  the  context  of  proper  parametric 
models  a  third  type  of  parametric  bias  and  variance  arises,  which  is  closely  related  to  its  functional 
counterpart.  Assuming  that  the  model  is  believed  to  be  proper  (regardless  of  whether  or  not  it  really  is), 
differential  learning  trades  an  increase  in  the  classifier’s  parametric  and  functional  bias  for  a  decrease  in  both 
its  discriminant  bias  and  variance.  Whether  or  not  the  trade  is  a  good  one  from  a  classification  perspective 
depends  upon  whether  or  not  the  parametric  model  is  indeed  proper.  If  it  is,  the  trade  is  not  a  good  one  for 
small  training  sample  sizes  because  probabilistic  learning  can  probably  generate  a  more  efficient  classifier 
with  lower  parametric  and  functional  bias/variance  as  well.  If  it  is  not,  the  trade  is  a  good  one  for  both 
small  and  large  training  sample  sizes  because  differential  learning  generates  the  relatively  efficient  classfier, 
whereas  probabilistic  learning  generates  a  distinctly  inefficient  classifier. 

Interpreting  modeb  —  One  of  the  many  criticisms  leveled  against  neural  network  classifiers  is  that, 
owing  to  their  complexity,  they  do  not  reveal  readily  discernible  relationships  between  the  feature  vector  and 
(he  Bayes-optimal  classification.  Indeed,  part  of  the  appeal  of  parametric  models  (here  we  use  the  term  in  the 
traditional  sense)  is  their  simple  structure,  which  lends  itself  to  straightforward  interpretation.  Our  experience 
is  that  physicians,  for  example,  prefer  a  bad  model  (hat  is  readily  interpretable  to  a  good  one  that  is  not  From 
an  engineering  perspective  this  seems  silly,  but  from  a  medical  perspective  —  people’s  lives  may  depend  on 
the  classifier’s  discrimination  —  the  decision  favors  certainty  over  uncertainty;  this  is  a  rational,  defensible 
preference.  By  implicitly  assuming  that  the  parametric  model  of  the  data  is  improper,  differential  learning 
adds  one  more  layer  of  uncertainty  in  the  eyes  of  some  potential  users.  We  must  therefore  develop  better 
theories  and  tools  for  understanding  these  models  (see  section  1 1 .3)  if  we  are  to  eliminate  the  uncertainty  and 
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exploit  the  superior  generalization  afforded  by  differential  learning. 

Representation  —  Finally,  the  renaissance  of  conncctionism  has  brought  the  issue  of  representation  to 
the  fore.  In  our  terminology,  the  choice  of  hypothesis  class  is  the  choice  of  representation.  This  choice  and 
how  it  is  made  are  topics  of  active  debate.  Tnere  seems  to  be  strong  consensus  that  a  good  representation  of 
the  data,  carefully  engineered  prior  to  learning,  is  the  best  assurance  of  good  generalization.  As  we  state  in 
chapter  8,  we  dispute  this  conclusion  and  offer  both  our  theoretical  and  experimental  results  as  evidence  that 
representation  is  not  such  an  important  issue.^  Complexity  issues  are  undeniably  important,  as  Vapnik  has 
clearly  proven,  but  the  specific  functional  basis  with  which  the  data  is  modeled  is  secondary  when  differential 
learning  is  employed.  As  long  as  the  hypothesis  class  has  sufficient  functional  complexity  to  approximate  the 
Bayes-optimal  class  boundaries  on  feature  vector  space,  the  representation  is  adequate.  This  is  obviously  not 
the  case  with  probabilistic  learning,  since  it  seeks  to  model  the  feature  vector’s  a  posteriori  class  probabilities 
in  addition  to  its  class  boundaries  —  a  function  approximation  task  for  which  representation  is  a  key  issue. 

Weighting  Risks  —  Not  all  classification  tasks  weight  classifications  equally.  The  magnetic  resonance 
image  (MRI)  interpretation  task  of  chapter  9  is  a  good  example.  The  risk  of  failing  to  detect  avascular 
necrosis  (AVN)  should  be  weighted  more  heavily  than  the  risk  of  a  “false  positive”.  Different  weightings  are 
incorporated  into  probabilistic  models  by  a  simple  application  of  Bayes  rule  after  the  classifier  has  learned  the 
training  sample.  Although  this  same  procedure  can  be  used  with  differentially  generated  classifiers,  it  is  no* 
theoretically  defensible,  since  the  classifier’s  outputs  do  not  represent  probabilistic  estimates.  Altering  the 
empirical  class  prior  probabilities  of  the  training  sample  to  account  for  the  Bayesian  risk  formalism  is  a  more 
defensible  approach  with  differential  learning;  equivalently,  the  step  size  of  the  iterative  search  algorithm 
used  for  differential  learning  can  be  weighted  in  proportion  to  the  risk  associated  with  each  class.  Both  of 
these  techniques  can  be  shown  to  implement  the  Bayesian  risk  weighting  formalism. 

lU  Future  Research 

We  view  the  learning  process  as  one  in  which  the  learner  must 

•  choose  a  strategy  for  learning  the  training  data  with  a  model, 

•  choose  a  means  of  implementing  the  learning  strategy  (i.e.  a  specific  algorithm  that  implements  the 
strategy), 

•  acquire  the  training  data  and  choose  the  manner  in  which  that  data  is  represented, 

^  We  do  not  make  this  statement  without  regard  to  the  effect  of  representation  on  the  learning  rate.  The  choice  of  representation  can 
mean  the  difference  between  reasonably  fast  learning  and  unreasonably  slow  learning,  so  it  is  an  important  choice.  Indeed,  the  end  of 
section  .S.^.l  implies  that  differential  learning  frees  us  to  choose  models  that  yield  Bayesian  discrimination  and  learn  reasonably  fa.st 
over  those  that  might  be  better  probabilistic  representations  of  the  data  but  take  unreasonably  long  to  learn. 
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Figure  1 1 . 1 ;  A  simplified  view  of  efficient  autonomous  learning. 


( 


•  choose  a  hypothesis  class  (i.e.,  a  limited  set  of  choices  from  which  the  classifier  will  be  generated), 

•  allocate  the  storage  and  computational  resources  necessary  for  the  data,  the  hypothesis  class,  and  the 
learning  strategy  as  implemented, 

•  learn,  and 

•  assess  the  resulting  model  in  terms  of  alternative  plausible  models. 

Figure  1 1.1  illustrates  this  view  of  learning.  Our  research  to  date  has  focus.sed  on  proving  that  differential 
learning  is  an  optimal  strategy,  in  that  it  guarantees  the  best  generalization  allowed  by  the  choice  of  hypothesis 
class.  We  have  also  developed  a  computationally  efficient  implementation  of  differential  learning.  Given 
Occam’s  razor,  the  efficiency  and  minimum-complexity  requirements  of  differential  learning  are  significant: 
the  simplest  model  that  explains  the  data  will  generalize  best,  given  a  finite  number  of  training  examples,  and 
this  model  can  in  principle  be  generateu  with  differential  learning. 

Since  Kolmogorov’s  theorem  [77]  can  be  interpreted  as  a  proof  that  'he  minimum-complexity  Bayes- 
optimal  classifier  can  be  determined  only  by  exhaustive  search,  we  are  faced  with  the  challenge  of 
differentially  learning  a  reasonable  approximation  to  that  classifier  and  comparing  it  with  other  plausible 
models.  Since  the  classifier’s  complexity  is  directly  related  to  the  training  data’s  form  (i.e.,  the  specific 
form  of  the  feature  vector),  the  challenge  of  finding  a  reasonable  approximation  to  the  minimum-complexity 
classifier  involves  choosing  a  form  for  the  data  and  generating  a  model  that  explains  the  data.  In  turn,  the 
process  of  choosing  a  data  form  and  generating  a  model  .an  be  viewed  as  an  iterative  search  on  the  joint 
space  of  all  possible  data  forms  and  models.  In  our  research  to  date,  we  have  chosen  the  data  form  and  the 


Iiypotlicsis  class  (i.c.,  the  set  of  allowed  models)  by  a  pr«K'cdurc  that  requires  subsianiial  human  oversight. 
The  training  data  form  has  been  fixed  prior  to  learning.  Likewise,  hypothesis  class  selection  has  been  done 
by  humans  prior  to  learning  and  has  remained  fixed  during  learning.  Finally,  learning  rates,  CFM  confidence 
parameter  reduction  schedules,  and  regularization  (c.g.,  weight  decay  and  weight  smoothing  factors)  have 
been  fixed  by  humans  prior  to  learning.  If  the  learning  machine  is  to  be  truly  autonomous,  all  of  these 
choices  must  be  controlled  by  the  machine  during  learning.  Clearly  then,  future  work  should  entail  theory 
and  procedures  by  which  the  learning  machine  can  continuously  manipulate  the  training  data,  the  hypothesis 
class,  and  the  learning  search  procedure  in  a  manner  that  holh  exploits  and  is  consistent  with  the  efficient, 
minimum-complexity  nature  of  differential  learning. 

The  critical  reader  will  note  that  numerous  researchers  arc  exploring  both  theories  and  algorithms  for 
automatically  regulating  model  complexity  during  learning.  The  theoretical  work  of  MacKay  (e.g.,  188])  and 
the  cascade  correlation  (32]  and  optimal  brain  damage  (OBD)  (85)  algorithms  are  well-known  works  in  the 
conncctionist  literature.  Since  all  of  these  works  derive  from  an  inefficient  probabilistic  view  of  learning, 
they  can  all  be  shown  to  be  inefficient  paradigms  for  model  complexity  regulation.  Nevertheless,  they  can  he 
adapted  to  differential  learning  (a  process  that  we  have  already  begun  with  encouraging  results).  We  therefore 
believe  that  differential  variants  of  these  techniques  hold  promise  for  autonomous  differential  learning. 

Finally,  a  statistically  rigorous  method  of  testing/rejecting  classification  hypotheses  made  by  the 
differentially-generated  classifier  after  learning  needs  to  be  developed.  Such  a  testing  procedure  could  form 
the  basis  of  more  sophisticated  hypothesis  testing  procedures  for  model  interpretation  (recall  section  1 1 .2)  as 
well  as  classification  assessment. 


Appendix  A 

Glossary  of  Notation 


We  employ  a  mixture  of  the  general  notational  conventions  of  [45,  29,  1 17,  100).  The  list  below  contains  a 
comprehensive  list  of  notation  used  in  the  text;  common  symbols  are  omitted. 


Symbol 


Meaning 


A 

•  • 

4 

> 

< 

Vz  (/(Z)) 
0 


II 11 

AREn—foo 


C 

CE 


CE(5"|e) 


Cf 


Read,  . .  Is  defined  as ” 

Read,  “Therefore.” 

Read,  “...is  not  in  ...” 

Read,  “There  does  not  exist ...” 

Read,  ”...  approximately  greater  than  or  equal  to  ...  ” 

Read,  ”...  approximately  less  than  or  equal  to  ...  ” 

The  gradient  of  /(Z)  with  respect  to  the  vector  Z. 

The  zero  vector  (the  number  of  elements  in  the  vector  is  context- 
dependent). 

The  cardinality  of  a  set;  the  absolute  value  of  a  number;  the  determinant 
of  a  matrix. 

The  magnitude  of  a  vector. 

Asymptotic  relative  efficiency  (see  definition  3. 1 8). 

The  boundary  on  X  between  classes  U’,  and  (Jy. 

The  number  of  classes  (i.e.,  concepts)  in  a  pattern  recognition  task. 

The  Kullback-Leibler  information  distance  [82,  81],  also  known  as  the 
“cross  entropy”  objective  function. 

The  CE  generated  by  the  training  sample  5" ,  given  the  discriminator 
^(X|9)  with  parameterization  9. 

The  correct  fraction  of  discriminator  output  space  (see  definition  S.I2). 
The  monotonic  correct  fraction  of  discriminator  output  space  (see  defini¬ 
tion  5.15). 

The  monotonic  correct  fraction  of  discriminator  output  space  associated 
with  the  objective  function  $ ,  given  a  C-dimensional  discriminator  output 
space  (i.e.,  a  C-clas$  learning  task). 
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Symbol 


Meaning 


CFM 

CFM  (5"|e) 

CFM(5"|»[itl) 

dimvc() 

S 


Sr 


^reject 


Sleamed 

<5,(X|0*) 


D 


-.D 


P(X) 

I>(X)fl„.„ 

P*(X) 

V*{X) 


V{X\0) 
DError  [0 1  B\ 


DBias  [0 1  n,  G(0),  A] 


DVar(^|n,  G(0),  A] 


The  non-monotonic  correct  fraction  of  discriminator  output  space  (see 
definition  5.14). 

The  CFM  objective  function. 

The  CFM  generated  by  the  training  sample  5" ,  given  the  discriminator 
0[X\B)  with  parameterization  B. 

The  CFM  generated  by  the  training  sample  5" ,  given  the  discriminator 
0{X  I B)  with  parameterization  9[il;]  at  learning  iteration  k . 

The  Vapnik-Chervonenkis  (VC)  dimension  [137,  136],  a  measure  of 
classifier  complexity  (see  section  3.5). 

The  discriminant  differential  (see  definition  2.7).  Note:  the  somewhat 
smaller  notation  S{  • )  denotes  the  Dirac  della  function  (e.g.,  [80,  pg. 
266]);  the  use  is  made  clear  in  the  text. 

Given  the  example/class  label  pair  {X^  ,  W^).  Sr  =  )>  —  y7  is  the 
discriminant  differential  associated  with  the  class  W'^  of  the  example  X^ 
(see  section  2.2.4). 

The  reject  threshold  value  of  the  discriminant  differential:  if  a  test  example 
generates  a  discriminant  differential  less  than  this  value,  the  classification 
is  rejected  as  invalid. 

The  learned  threshold  value  of  the  discriminant  differential:  if  a  training 
example  generates  a  discriminant  differential  greater  than  or  equal  to  this 
value,  the  example  has  been  learned. 

The  discriminant  differential  associated  with  the  ith  discriminant  function 
g/(X|B)  (see  definition  2.7). 

The  discriminant  differential  associated  with  the  ith  discriminant  function 
gi{X  I B' )  (see  definition  2.7).  This  notation  indicates  the  the  discrimina¬ 
tor’s  parameterization  B‘  is  optimal  to  the  extent  that  it  maximizes  the 
CFM  objective  function. 

The  high-stale  target  value  associated  with  an  error  measure  objective 
function  (see  section  2.3). 

The  low-state  target  value  associated  with  an  error  measure  objective 
function  (see  section  2.3). 

A  classifier  of  the  random  vector  X. 

A  Bayes-optimal  classifier  of  X  (see  definition  2. 1 ). 

The  efficient  classifier  of  X  (see  definition  3.12). 

The  relatively  efficient  classifier  of  X  (see  definition  3.15),  that  is,  the 
classifier  that  exhibits  the  lowest  MSDE  allowed  by  the  hypothesis  class 
from  which  it  is  generated. 

The  class  label  (or  classification)  assigned  to  the  feature  vector  X  by  the 
classifier. 

The  discriminant  error  of  the  classifier  generated  from  the  hypothesis  class 
G(@)  by  the  learning  strategy  A ,  given  a  training  sample  size  of  n  (see 
definition  3.6). 

The  discriminant  bias  of  the  classifier  generated  from  the  hypothesis  class 
G(d)  by  the  learning  strategy  A ,  given  a  training  sample  size  of  n  (see 
definition  3.7). 

The  discriminant  variance  of  the  classifier  generated  from  the  hypothesis 
class  G(9)  by  the  learning  sUategy  A,  given  a  training  sample  size  of 
n  (see  definition  3.8). 
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Symbol _ Meaning _ 

DBias[0|  {/I . ,n(^:}.G(0), A] 

The  estimated  discriminant  bias  of  the  classifier  repeatedly  generated  from 
the  hypothesis  class  G{0)  by  the  learning  strategy  A,  given  K  training 
samples  of  sizes  {ni . . . .  ,  /i/r}. 

DVar[^|  {/I . f/jf},G(0),  A] 

The  estimated  discriminant  variance  of  the  classiHer  repeatedly  generated 
from  the  hypothesis  class  G(0)  by  the  learning  strategy  A,  given  K 
training  samples  of  sizes  {ni . Wir}. 

<An'|x(^i  I X)  'The  a  posteriori  differential  of  class  ,  given  the  feature  vector  X  (see 

definition  2.5). 

EM  The  general  error  measure  objective  function. 

EM  (5"  1 0)  The  EM  generated  by  the  training  sample  S” ,  given  the  discriminator 

Q('X.\0)  with  parameterization  6. 

Ex  [/(X)  ]  The  expectation  of  the  function  /(X),  taken  over  the  domain  of  the  random 

vector  (or  variable)  X. 

^(X)  A  discriminant  function  (more  precisely,  a  set  of  C*  discriminant  functions) 

for  X. 

The  Bayesian  discriminant  function  (BDF)  of  X  (in  any  of  its  forms  — 
see  definition  2.2). 

^(X)&n«.f-Pmtufcr7i.tric  A  probabilistic  form  of  the  BDF  (definition  2.4). 

Pwhihilistic  A  strictly  probabilistic  form  of  the  BDF  (definition  2.3). 

^{^)Bayr^  Differtmi<ti  A  differential  form  of  the  BDF  (definition  2.6). 

A  Strictly  differential  form  of  the  BDF  (definition  2.5). 

Fbo,w  The  set  of  all  BDFs  of  X . 

^Banf-Pwhahiiimc  The  Set  of  all  probabilistic  forms  of  the  BDF  of  X . 

^ Bmr.t-Sinctiy  PmhaHUmk  The  Set  of  all  Strictly  probabilistic  forms  of  the  BDF  of  X . 

^ Boyts  iiigtnmiai  The  sct  of  all  differential  forms  of  the  BDF  of  X . 

^ Bayet-Siriaiy  Diffenntiai  The  Set  of  all  Strictly  differential  forms  of  the  BDF  of  X . 

#  The  general  objective  function  (or  empirical  risk  measure). 

gi(X  1 0)  The  classifier’s  discriminant  function  for  class  O',- ;  the  parameterization 

of  the  over-all  discriminator  is  0. 

gi(X  1 0‘ )  The  classifier’s  discriminant  function  for  class  Uj ;  the  parameterization 

of  the  over-all  discriminator  is  0‘,  which  is  optimal  by  some  objective 
function. 

Q{X\0)  The  classifier’s  discriminator  (i.e.,  the  set  of  C  discriminant  functions); 

the  discriminator’s  parameterization  is  0. 

Q(X\0‘)  ’The  classifier’s  discriminator  (i.e.,  the  set  of  C  discriminant  functions); 

the  discriminator’s  parameterization  is  9*,  which  is  optimal  by  some 
objective  function. 

Q(X  1 0)B(ntf  A  Bayesian  discriminant  function  (BDF)  of  X  contained  in  the  hypothesis 

class  G(0). 

(?(X  1 0)&n«./>n*oW«mc  A  probabilistic  form  of  the  BDF  of  X  contained  in  the  hypothesis  class 

G(0). 

Q(X  1 0)8ayes  StriciiyPn*ahiiifiic  A  Strictly  probabilistic  form  of  the  BDF  of  X  contained  in  the  hypothesis 

class  G(0). 

0(X\0)Baye!t  Differfiuiai  A  differential  form  of  the  BDF  of  X  contained  in  the  hypothesis  class 

G(0). 
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Appendix  A:  Glossary 


Symbol 


Meaning 


Q{X\0)b„.„  -Strictly  Differential 

G 

G(0) 

% 

Gm) 


G(0,X)p„^, 

G(0)s(n« 

G{  0  )Baye.i-Prtihahilistic 

G(0)tfm«  -Stricfly  PmhahiUxtic 

G(0)Bmt.,  ■Differential 

G{0)8flv«  ■Siricih  Differtniial 

G{0i)B<nt.i 


G  (  0^  )  Bayef-PrtihahitiMir 

G(04-)s<nM-Sfrirt/v  PrMhiUnic 

G(04-)8m«  •Differtniial 
G(0i)BoyM  -Strictly  Differential 

T[] 

T^l] 

Hz  (/(Z)) 

iff 

1 

I 


A  slrictly  differential  form  of  the  BDF  of  X  contained  in  the  hypothesis 
class  G(0) . 

The  functional  basis  of  the  hypothesis  class  G(0). 

The  hypothesis  class  with  functional  basis  G  and  parameter  space  0. 
The  set  of  all  hypothesis  classes. 

The  hypothesis  class  (in  the  set  of  all  hypothesis  classes)  with  the  minimum 
functional  complexity  necessary  to  perform  a  particular  pattern  recognition 
task  with  a  specified  level  of  generalization.  The  generalization  of  a 
classifier  generated  from  G(0i)  with  a  training  sample  size  of  n  is 
measured  in  terms  of  its  mean-squared  discriminant  error  (MSDE  —  see 
definition  3.9). 

The  proper  parametric  model  of  X  (see  definition  3.13). 

The  set  of  all  BDFs  of  X  in  the  hypothesis  class  G(0). 

The  set  of  all  probabilistic  forms  of  the  BDF  of  X  in  the  hypothesis  class 

G(0). 

The  set  of  all  strictly  probabilistic  forms  of  the  BDF  of  X  in  the  hypothesis 
class  G(0). 

The  set  of  all  differential  forms  of  the  BDF  of  X  in  the  hypothesis  class 

G(0). 

The  set  of  all  strictly  differential  forms  of  the  BDF  of  X  in  the  hypothesis 
class  G(0). 

The  set  of  all  BDFs  of  X  in  the  minimum-complexity  hypothesis  class 
G(0)  (i.e.,  the  hypothesis  class  with  the  least  functional  complexity 
necessary  for  Bayesian  discrimination). 

The  set  of  all  probabilistic  forms  of  the  BDF  of  X  in  the  minimum- 
complexity  hypothesis  class  G(0). 

The  set  of  all  strictly  probabilistic  forms  of  the  BDF  of  X  in  the 
minimum-complexity  hypothesis  class  G(0). 

The  set  of  all  differential  forms  of  the  BDF  of  X  in  the  minimum- 
complexity  hypothesis  class  G(0). 

The  set  of  all  strictly  differential  forms  of  the  BDF  of  X  in  the  minimum- 
complexity  hypothesis  class  G(0). 

The  general  classifier  functional  complexity  measure  of  section  3.S, 
page  74. 

The  upper  bound  on  T  [  •  ]  for  a  particular  choice  of  hypothesis  class  (see 
section  3.5). 

The  Hessian  (i.e.,  the  matrix  of  second-order  derivatives)  of  /(Z)  with 
respect  to  the  vector  Z. 

Read,  “. . .  if  and  only  if ...  ” 

The  identity  matrix. 

The  identity  vector. 


XT 

XT mono 
^XTnmno(C) 


The  incorrect  fraction  of  discriminator  output  space  (see  definition  5. 1 3). 
The  monotonic  incorrect  fraction  of  discriminator  output  space  (see  defi¬ 
nition  5.17). 

The  monotonic  incorrect  fraction  of  discriminator  output  space  associated 
with  the  objective  function  $ ,  given  a  C-dimensional  discriminator  output 
space  (i.e.,  a  C-class  learning  task). 
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Symbol 


Meaning 


XT 


L() 


IL 

A 

Aa 


Ap 


Ap-MSE 

Ap-CE 

Ap-LMS 


Ap-ML 

MAE 


MAE  (5"  I  e) 


MT 

*MT(C) 


MSDE[P|«.G(0),A] 


The  non-monotonic  incorrect  fraction  of  discriminator  output  space  (sec 
definition  5.16). 

A  log-likelihood  function. 

The  set  of  all  learning  strategies. 

The  general  learning  strategy. 

The  differential  learning  strategy  (associated  with  the  CFM  objective 
function). 

The  probabilistic  learning  strategy  (associated  with  error  measure  objective 
functions). 

Probabilistic  learning  via  the  MSE  objective  function. 

Probabilistic  learning  via  the  CE  objective  function. 

Probabilistic  learning  via  the  LMS  objective  function  (this  is  identical  to 
learning  via  the  MSE  objective  function). 

Probabilistic  learning  via  the  method  of  maximum-likelihood. 

The  mean  absolute  error  (MAE)  objective  function  (also  known  as  “least 
absolute  error”  and  “least  absolute  deviation”. 

The  MAE  generated  by  the  training  sample  <S" ,  given  the  discriminator 
Q{X\0)  with  parameterization  0. 

The  monotonic  fraction  of  discriminator  output  space  (see  definition  S.  1 8). 
The  monotonic  fraction  of  discriminator  output  space  associated  with  the 
objective  function  $ ,  given  a  C-dimensional  discriminator  output  space 
(i.e.,  a  C-class  learning  task). 

The  mean-squared  discriminant  error  (MSDE)  of  the  classifier  generated 
from  the  hypothesis  class  G(0)  by  the  learning  strategy  A ,  given  a 
training  sample  size  of  n  (see  definition  3.9). 


MSDE  I  {n, . njr},G(0),A] 


MSE 

MSE  (5"|») 


w. 


il 


The  estimatedMSDE  of  the  classifier  repeatedly  generated  from  the  hy¬ 
pothesis  class  G(0)  by  the  learning  strategy  A,  given  K  training 
samples  of  sizes  {ni , . . .  ,  njr}. 

The  mean-squared  error  (MSE)  objective  function. 

The  MSE  generated  by  the  training  sample  i5" ,  given  the  discriminator 
^(X|0)  with  parameterization  0. 

The  mean  of  a  Gaussian-distributed  random  vector.  The  notation  /i, 
generally  refers  to  the  mean  of  the  random  vector’s  ith  class-conditional 
Gaussian  probability  density  function. 

The  number  of  examples  of  the  pattern  Xp  in  a  training  sample  of  size  n. 
The  number  of  examples  of  the  pattern  Xp  representing  class  UJi  in  a 
training  sample  of  size  n. 

A  random  noise  variable. 

A  random  noise  vector. 

The  Ith  class  (i.e.,  concept)  that  a  random  feature  vector  can  represent 
Read.  “Not  a;,  .” 

The  classification  assigned  to  X  by  the  Bayes-optimal  classifier  (defini- 
tion  2.1,  page  17). 

The  domain  of  the  class  label  W’.  Sometimes  called  classification 
(or  class  label)  space,  that  is,  the  set  of  all  class  labels  with  which  a 
feature  vector  can  be  paired.  For  the  C-class  pattern  recognition  task. 


W  sU  =  {UJ . U)c). 
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Appendix  A:  Glossary 


Symbol  Meaning 


Pz(0 

The  probability  that  the  random  variable  (or  vector)  z  will  take  on  the 
value  (, .  This  notation  is  equivalent  to  the  notation  P(z  =0- 

h(o 

An  estimate  of  the  probability  that  the  random  variable  (or  vector)  z  will 
take  on  the  value  (  . 

The  prior  probability  of  class  LUj. 

Pvvix(t^.lX) 

The  a  posteriori  probability  of  class  UJi ,  given  the  feature  vector  X . 

The  error  rate  (i.e.,  probability  of  error)  for  the  classifier  with  the  discrim¬ 
inator  Q{X\0)  (see definition  3. 1 ). 

Pr{G\e,fi) 

An  estimate  of  the  error  rate  for  the  classifier  with  the  discriminator 
y(X  1 0) ;  the  estimate  is  based  on  a  test  sample  size  of  •;  (see  defini¬ 
tion  8.1). 

P,  {TBayrs) 

The  error  rate  exhibited  by  the  Bayes-optimal  classifier  of  X  (see 
definition  3.2). 

Pe 

An  estimate  of  the  error  rate  exhibited  by  the  Bayes-optimal  classifier  of 

Y 

R 

A  rotation  matrix. 

RE 

Estimated  relative  efficiency  (see  definition  8.6). 

R 

The  set  of  all  real  numbers. 

PxW 

The  probability  density  function  (pdO  of  the  feature  vector  X. 

Note:  When  written  p,(X),  the  notation  is  meant  to  convey,  “the  pdf  of  X  evaluated  at  some 

arbitrary  value  of  X;’’  when  written  p,(Z),  the  notation  is  meant  to  convey,  “the  pdf  of  X  eval-  # 

uated  at  X  =  Z.”  We  use  this  same  notational  convention  for  other  probability  measures  of  X. 


/^x|»v(^  l^») 

s.l. 

SNR 

S” 


cr[S,tl'] 


S 

T 

r 

T{{Xi,Wi)) 


The  fth  class-conditional  pdf  of  X,  that  is,  the  pdf  of  X  when  it  represents 
class  (jd,-. 

The  joint  probability  density  of  X  and  class  Ut,. 

Read,  . .  such  that ...” 

Signal-to-noise  ratio. 

The  training  sample  of  size  n,  that  is,  the  set  of  n  randomly-drawn 
example/class  label  pairs  {(X'.W'),...  ,(X",W")}  used  to  generate 
the  classifier  from  its  hypothesis  class. 

The  CFM  generated  by  a  discriminant  differential  of  S  when  the  CFM 
confidence  parameter  is  il’.  Note:  the  somewhat  smaller  notation 
denotes  the  variance  parameter  of  a  Gaussian-distributed  random  variable; 
the  use  is  made  clear  in  the  text. 

A  covariance  matrix.  The  notation  27,-  generally  refers  to  the  covariance 
matrix  of  the  random  vector’s  ith  class-conditional  probability  density 
function. 

Denotes  the  transpose  of  a  vector  (e.g.,  X^ ). 

A  target  function:  when  =  W,,  Tj{{X> ,yv^))  =  1;  otherwise 
r,((X/,W/))  =  0. 

A  target  vector  for  the  discriminator  0(X|0),  used  when  learning 
probabilistically  via  an  error  measure  objective  function. 

The  target  vector  for  the  discriminator  Q{Xl  \  0) ,  used  when  learning 
probabilistically  via  an  error  measure  objective  function.  The  target  class 
is  indicated  by  WL 


Symbol 


Meaning 


0 

9 

eo 

e 


0' 


0{X) 

U(.r) 

U+(x) 

Var[.T] 

W 

W’> 

w; 


X 

X 

Xi 

<XMV>) 

X, 

X/l 


<x/\w;) 

E(n) 

H(»/) 

>1 

)V 


Y 


^rorrret 
Y  inrnmct 


y 


A  parameter  associated  with  the  classifier’s  discriminator. 

The  parameter  vector  of  the  classiHer's  discriminator. 

The  discriminator’s  initial  parameterization  (i.e.,  its  parameterization  prior 
to  learning). 

The  domain  of  the  parameter  vector  9.  Sometimes  called  parameter  space, 
that  is,  the  set  of  all  parameter  vectors  that  the  discriminator  can  have: 

9  e  e. 

A  discriminator’s  parameter  vector  that  is  optimal  by  some  objective 
function. 

The  Jth  element  of  the  optimal  parameter  vector  9“. 

Refer  to  (2.93),  page  46. 

Refer  to  (2.93),  page  46. 

The  Heaviside  step  function  of  x  (e.g.,  [80,  pg.  258]) 

The  modified  Heaviside  step  function,  which  is  equal  to  I  for  all  x  >0, 
0  for  all  X  <  0,  and  5  for  all  jt  =  0. 

The  variance  (i.e.,  second  central  moment)  of  the  random  variable  x. 

The  stochastic  class  label  associated  with  the  feature  vector  X. 

The  class  label  associated  the  7th  example  of  X. 

The  class  label  associated  with  the  yth  example  of  the  pattern  Xp.  Note 
that  Wy  implies  a  specific  value  of  X  (i.e.,  it  implies  the  panem  Xp ); 
does  not. 

The  domain  of  the  feature  vector  X.  Sometimes  called  feature  vector 
space,  that  is,  the  set  of  all  possible  feature  vectors  such  that  X  €  X* 
The  feature  vector  (or  attribute  vector). 

The  /th  example  of  X,  that  is,  the  j\h  realization  of  the  random  feature 
vector  X. 

The  yth  example  of  X,  along  with  its  class  label. 

A  particular  pattern  (i.e.,  a  particular  value  of  X ). 

The  yth  example  of  Xp,  that  is,  the  /th  realization  of  the  random  feature 
vector  X  having  the  value  Xp.  Note  that  Xy{  implies  a  specific  value  of 
X  (i.e.,  it  implies  the  pattern  Xp );  X^  does  not. 

The  /th  example  of  Xp ,  along  with  its  class  label. 

The  number  of  misclassified  examples  in  5" ,  the  training  sample  of  size 
n. 

The  number  of  misclassified  examples  in  the  test  sample  of  size  rj. 
Short-hand  notation  for  the  ith  discriminator  output  gi{X  \  9). 

Given  the  example/class  label  pair  {Xl,YVl),  yr  is  the  discriminator 
output  associated  with  the  class  of  the  example  X^  (see  section  2.2.4). 

Given  the  example/class  label  pair  (X^ ,  W^) ,  y7  is  the  largest  discrim¬ 
inator  output  not  associated  with  the  class  VV^  of  the  example  X^  (see 
section  2.2.4). 

Short-hand  notation  for  the  output  state  of  the  classifier’s  discriminator 
Q(X\9). 

The  ’’correct”  vertex  of  discriminator  output  space  (see  (5.3) ). 

The  “incorrect”  vertex  of  discriminator  output  space  (see  (5.2) ). 

The  domain  of  the  discriminator’s  output  Y.  Sometimes  called  discrimi¬ 
nator  output  space  (see  section  2.2. 1 ).  Thus,  Y  G  y. 
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Symbol 


Meaning 


ycorrea  The  correct  region  (or  side)  of  discriminator  output  space  (see  defini¬ 

tion  5.8). 

yi/i</trrrrr  The  correct  region  (or  side)  of  discriminator  output  space  (see  defini¬ 

tion  5.6). 

(/’  The  confidence  parameter  for  the  classification  figure-of-merit  (CFM 

objective  function). 

The  set  of  all  positive  integers  (i.e.,  all  positive  natural  numbers). 


Appendix  B 

Notes  on  Convergence 


The  proofs  of  chapters  2  and  3  rely  on  notions  of  convergence  that  require  some  explanation.  Ide¬ 
ally,  we  would  expect  that  the  statistics  of  a  training  sample  S"  =  {(X',W'},  ...  ,(X",VV")} 
reflect  the  true  nature  of  the  random  feature  vector  X  (i.e.,  Px(^)y  {Prv{t^i).  •••  .Pvv(^^c)}.  and 
{Ptv|x(^^i  IX),...,  Pvv|x(t<^c  I X) } )  in  the  limit  that  the  training  sample  size  grows  infinitely  large  (i.e., 
n  -+  oo).  Moreover,  we  would  expect  that  this  convergence  of  the  empirical  probability  measures  to  the 
true  measuns  would,  for  each  and  every  asymptotically  large  training  sample,  occur  with  certainty  (i.e., 
convergence  with  probability  one)  and  uniformly  over  all  feature  vector  space  X  convergence  at  some 
non-zero  rate  would  be  guaranteed  for  all  X  €  ^  at  which  f>x(X)  is  defined). 

In  fact  the  empirical  cumulative  distribution  function  (cdO  of  the  arbitrary  random  variable  x  does,  in 
general,  uniformly  converge  to  the  true  underlying  cdf  with  probability  one.  The  Glivenko-C^ntelli  Theorem 
(e.g.,  [105,  sec.  II.3])  proves  this  expression  of  the  uniform  strong  law  of  large  numbers.'  Vapnik  describes 
an  extension  of  the  theorem  to  the  general  A(-dimensional  random  feature  vector  X  in  [136,  ch.  6];  we 
assume  that  X  exhibits  uniform  convergence  with  probability  one  accordingly. 


'  See  (28.  sec.  9.6)  for  an  concise,  readable  summary  of  the  Glivenko-Cantelli  Theorem. 


Appendix  C 


The  Box  Plot  Statistical  Summary 


The  box  plot  is  a  non-parametric  statistical  summary  developed  by  John  W.  Tukey  [131,  ch.  2].'  Given  a 
sample  S"  =  {xi ,  . . .  ,  jc„}  of  the  random  variable  x.  the  box  plot  is  a  concise  graphical  summary  of  the 
empirical  low-order  moments  of  x  —  one  that  makes  no  assumptions  about  the  probability  density  function 

(pdOof  X. 


C.1  How  to  Read  a  Box  Plot 

Figure  C.  1  shows  an  annotated  box  plot  for  a  hypothetical  sample  5”  of  x.  Note  that  all  the  examples  of 
5"  fall  well  within  the  range  of  30  <  x  <  100.  The  box  plot  Is  formed  by  sorting  all  the  examples  and 
dividing  them  into  “quartiles”  (i.e.,  into  four  groups,  each  of  which  represents  25%  of  5" ).  The  box  itself 
encompasses  the  middle  50%  of  S" .  The  top  25%  of  5"  is  denicted  by  the  vertical  line  extending  above 
the  box,  and  the  bottom  25%  of  <S"  is  depicted  by  the  vertical  'n:  extending  below  the  box.  The  box  is 
divided  by  a  horizontal  line  at  the  median  of  5" .  The  inner  and  (if  shown)  outer  "T” -shaped  “fences”  of 
each  plot  depict  the  nominal  lower  bound  of  the  first  quartile  and  nominal  upper  bound  of  the  fourth  quartile. 
Any  extreme  first/fourth  quartile  values  falling  beyond  the  outer  fence(s)  are  plotted  as  dots.  The  box  plot 
therefore  displays  all  of  the  data,  emphasizing  the  median  and  a  quartile  partitioning  of  the  sample. 

The  box  plot  has  a  number  of  advantages  as  a  statistical  graphic; 

•  It  is  simple  to  compute  and  display. 

•  It  makes  no  assumption  about  the  pdf  p^{x)  of  x. 

•  It  is  generally  a  more  meaningful  graphic  than  alternatives  such  as  histograms,  whisker  plots,  e. when 
the  sample  size  (i.e.,  n )  is  small  [131,  ch.  2]. 

•  One  can  easily  infer  the  low  order  central/non-central  moments  of  x  from  the  box  plot. 

'Tukey  is  well-known  for  the  Cooley-Tukey  fast  fourier  tiansfomi  (FFT)  algorithm  and  his  woik  with  Blackman  in  spectral 
estimation. 
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Figure  C.  1 :  A  box  plot  for  a  sample  of  the  random  variable  x. 


From  figure  C.l  we  can  see  that  the  median  value  for  5"  is  75.  iie  middle  50%  of  the  sample  is  fairly 
tightly  concentrated  about  the  median  on  thi'.  interval  [70, 80].  The  bottom  25%  of  the  sample  (i.e.,  the  first 
quartile)  spans  the  interval  [37, 70],  and  there  is  an  extreme  statistical  outlier  at  37.  In  contrast,  the  top  25% 
(i.e.,  the  fourth  quartile)  is  more  tightly  disUibuted  on  the  interval  [80, 90].  Thus,  the  box  plot  indicates  that 
the  empirical  distribution  of  x  is  skewed  towards  higher  values  of  x .  It  should  be  clear  from  the  figure 
that  the  box  plot  gives  the  observer  a  concise  non-parametric  sketch  of  the  median,  variance,  skewness,  and 
kurtosis  of  x .  Specifically,  the  height  of  the  box  and  the  length  of  its  fences  are  an  indication  of  the  variance 
in  the  classifier's  error  rate  over  all  trials;  the  symmetry  of  the  box  plot  (or  lack  thereoO  is  an  indication  of 
the  skewness;  and  the  height  of  the  box  in  comparison  to  the  length  of  the  fences  is  an  indication  of  kurtosis 
(i.e.,  how  abruptly  the  sample  peaks  about  its  median). 

The  computations  by  which  the  box  plot  is  constructed  from  a  data  sample  are  detailed  in  [131,  ch.  2]. 
We  provide  a  summary  of  them  in  the  following  section.  We  emphasize  that  it  is  not  necessary  to  understand 
the  following  material  in  order  to  interpret  a  box  plot,  we  provide  it  for  the  convenience  of  those  who  wish 
to  know  precisely  how  the  box  plot  fences  are  constructed. 
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c.2  How  to  Construct  a  Box  Plot  ^ 

The  first  step  in  constructing  a  box  plot  is  to  sort  the  sample  5" .  From  this  sorted  version  of  S"  (which  we 
will  denote  hy  S"  )  we  develop  a  “S-nuniber  summary":  it  comprises  the  lower  extreme  value  (low),  the 
first  quartile  boundary  (Q I ),  the  median  (tried),  the  third  quartile  boundary  (Q3),  and  the  upper  extreme  value 
(high)  of  the  sample.  If  X(,|  denotes  the  (n  —  /  +  l)th  ranked  example  of  S"  (i.e.,  Xm  denotes  the  lower 
extreme  example  and  X(„)  denotes  the  upper  extreme  example)  then  the  indices  of  the  examples  that  we  use 
to  compute  the  five  number  summary  are  obtained  from  the  following  five  real  numbers  (note  that  the  [zj 
operator  returns  the  largest  integer  not  greater  than  z ,  and  the  f  z  ]  operator  returns  the  smallest  integer  not 
less  than  z ): 


I'flow]  = 

1 

([med]  = 

n+  1 

2 

([high]  = 

n 

/[Ql]  = 

j  ■  [[/(med)]  +  1] 

*1Q3|  = 

n  +  1  -  /[Ql] 

The  resulting  five  number  summary  is  given  by 


low 

med 

=  /(/[med]) 

high 

Ql 

=  /('[Ql]) 

Q3 

=  /(•■(Q3J) 

where  Z*  denotes  the  set  of  all  positive  integers,  and 


(  •>f(*(aurat«)i  •  ([number]  €  2+ 

/(/[number])  =  < 

V  5  '  l^tblrviwber]))  -^UiinumbeT)))]  «  Otherwise 

Table  C.  I  lists  the  indices  of  the  sorted  RVs  that  we  would  use  to  compute  the  five  number  summary  for 
various  sample  sizes  (i.e.,  various  ns). 

^Adapted  from  the  original  by  Toltey  (l.tl.ch.  2). 
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Indices  ofx 

('7  . 

■1 

low 

Ql 

med 

Q3 

high 

3 

1 

1 

2 

3 

3 

4 

1 

1.2 

2,3 

.3.4 

4 

5 

1 

2 

3 

4 

.S 

6 

1 

2 

3,4 

5 

6 

7 

1 

2,3 

4 

5.6 

7 

8 

1 

2,3 

4,5 

6.7 

8 

9 

1 

3 

5 

7 

■i 

10 

1 

3 

5,6 

8 

19 

11 

1 

3,4 

6 

8,9 

III 

12 

1 

3,4 

6,7 

9,10 

19 

13 

1 

4 

7 

10 

19 

14 

1 

4 

7,8 

II 

19 

15 

1 

1 

8 

! 

■ — 1 

11.12 

1 

Table  C.l:  A  listing  of  indices  i  for  jr(,)  used  to  compute  box  plot  S-number  summaries  of  S"  for  various 
sample  sizes  (i.e.,  various  ns). 


Once  we  have  computed  the  5-number  summary  we  have  most  of  the  box  plot  built  (i.e.,  we  know  the 
location  of  the  box  and  its  median,  as  well  as  the  upper  and  lower  extreme  values).  The  only  remaining 
computations  are  those  for  the  locations  of  the  adjacent  and  outer  fences  for  the  first  and  fourth  quartiles 
of  S” .  In  short,  the  adjacent  fence  locations  are  displaced  from  their  quanile  boundaiy  by  no  more  than 
1 .5  times  the  distance  between  the  first  and  third  quartiles.  Likewise,  the  outer  fence  locations  are  displaced 
from  their  quartile  boundary  by  no  more  than  3  times  the  distance  between  the  first  and  third  quartiles.  The 
displacement  of  a  fence  from  its  respective  quartile  boundary  never  exceeds  the  location  of  the  extreme 
example  in  the  fence’s  quartile.  This  explains  why  some  box  plots  appear  to  have  missing  fences  (as  is  the 
case  for  the  fourth  quartile  data  of  figure  C.l );  in  reality  the  fences  coincide  with  other  fences  or  the  quartile 
boundaries,  so  they  do  not  appear  in  the  plot. 

Quantitatively,  the  first  quartile  fences  are  given  by 


lower  adjacent  fence 

lower  outer  fence 


Ql  -  I.51Q3  -  QIJ. 
^(1) ' 


=  { 

=  /  Ql  -  3. 
I 


-  3.0 fQ3  -  QIJ , 


jr,„  <  Ql  -  I.5IQ3  -  Ql] 
otherwise 

jr„.  <  Ql  -  3.0IQ3  -  Ql] 
otherwise 


The  fourth  quartile  quartile  fences  are  given  by 
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upper  adjacent  fence 
upper  outer  fence 


r  Q3  +  l.5[Q3  -  Ql], 

I  - 

f  Q3  +  3.0[Q3  -  Ql]  , 
1  -"Tut. 


jt,„,  >  Q3  +  1.5 [Q3  -  Ql] 
otherwise 

x,„,  >  Q3  +  3.0 [Q3  -  Ql] 
otherwise 

(C.l) 


For  the  case  in  which  x  is  normally  distributed  (i.e.,  x  ~  N(//.  the  adjacent  and  outer  fences  are 
displaced  two  and  four  standard  deviations  from  the  quartile  boundary  respectively^. 

Again,  all  examples  in  the  sample  falling  outside  the  upper  and  lower  outer  fences  are  extreme  outliers, 
which  are  represented  by  individual  symbols. 


’  Assuming  that  the  sample  size  n  is  large  so  that  S"  is  reptesentative  of  .r . 


Appendix  D 

A  Synthetic  Functional  Form  of  the 
Classification  Figure  of  Merit 


This  appendix  describes  in  detail  the  synthetic  asymmetric  function  we  employ  for  the  classirication 
rigure-of-merit.  Our  use  of  this  functional  form  has  a  two-fold  motivation: 

•  Chapter  2  shows  that  differential  learning  requires  a  sigmoidal  CFM  function  with  variable  *  ‘steepness’  ’ . 
However,  the  logistic  sigmoidal  form  originally  described  in  [SS]  is  symmetric:  when  it  has  a  steep 
transition  region  its  derivative  is  essentially  zero  outside  of  that  region.  As  described  in  section  D.3, 
this  leads  to  very  small  gradients  in  the  search  algorithm  used  to  And  the  optimal  parameterization  of 
the  classifier.  This  in  turn  leads  to  unreasonably  slow  learning.  In  order  to  overcome  the  problem, 
we  desire  an  asymmetric  sigmoid  that  retains  a  significant  non-zero  first  derivative  for  yet  un-leamed 
training  examples  (i.e.,  those  with  negative  discriminant  differentials  5 )  —  even  for  the  case  in  which 
the  sigmoid  is  steep  in  its  transition  region. 

•  We  require  a  mathematically  simple  synthetic  form  in  order  to  minimize  the  number  of  floating  point 
compulations  necessary  to  evaluate  the  function  and  its  first  and  second  derivatives. 

In  section  D.l  we  specify  the  synthetic  form  of  the  CFM  objective  function;  in  section  D.2  we  analyze 
the  computational  requirements  posed  by  its  evaluation  and  that  of  its  first  two  derivatives;  in  section  D.3  we 
analyze  the  convergence  properties  it  engenders  (i.e.,  how  fast  differential  learning  is,  using  synthetic  CFM), 
and  on  this  basis  contrast  it  with  the  original  logistic  sigmoidal  form  of  CFM;  in  section  D.4  we  derive  an 
upper  bound  on  the  synthetic  CFM  confidence  parameter  4’  that  guarantees  Bayes-optimal  discrimination, 
as  described  in  in  section  2.4;  in  section  D.6  we  list  ANSI-C  source  code  for  the  synthetic  form  and  its  first 
two  derivatives. 

We  wish  to  emphasize  that  our  development  of  the  synthetic  CFM  objective  function  was  motivated  by 
palpable  deficiencies  in  the  original  logistic  sigmoidal  form.  The  deficiencies  relate  primarily  to  the  poor 
convergence  properties  and  instability  of  differential  learning  via  the  original  form  of  CFM,  which  we  detail 
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in  section  D.3.  Some  readers  might  find  this  appendix  — and  section  D.3  in  particular  —  rather  abstract 
and  pedantic.  We  encourage  such  persons  to  recognize  the  cause-and-effect  relationship  here:  the  problems 
associated  with  differential  learning  via  the  original  logistic  sigmoidal  CFM  objective  function  led  to  the 
theory,  rather  than  vice-versa.  The  details  herein  were  (and  remain)  a  necessary  evil  on  the  path  to  an 
implementation  of  differential  learning  that  works  in  practice  as  well  as  it  does  in  theory. 

D.l  Specifications  for  the  Synthetic  CFM  Objective  Function 

We  create  a  piece-wise  linear  sigmoid  by  connecting  three  line  segments  with  two  arcs  (we  abuse  notation 
by  referring  to  these  arcs  in  terms  of  their  radii).  This  synthesis  is  depicted  in  figure  D.l.  The  lower  radius 
r„  is  generated  by  a  circle  with  a  centroid  (//„,<  f>yn)  •  which  is  constrained  to  lie  on  line  segment  A;  the 
radius  is  also  constrained  to  be  tangent  to  line  segment  B.  The  upper  radius  is  generated  by  a  circle  with  a 
centroid  (pxp,  ftyp) ,  which  is  constrained  to  lie  on  line  segment  C;  the  radius  is  constrained  to  be  tangent  to 
the  horizontal  line  of  unit  height.'  A  line  drawn  from  point  (—1,0)  to  point  (xi„„,ynm)  (the  latter  of  which 
is  tangent  to  the  lower  radius)  forms  the  lower  “leg”  of  the  sigmoid.  A  line  drawn  from  point  (xm.ym)  to 
point  (x^.  Vip)  (points  that  are  tangent  to  the  lower  and  upper  radii,  respectively)  forms  the  transition  region 
of  the  sigmoid.  A  line  drawn  from  point  {p^pA)  to  point  (1,1)  (the  former  of  which  is  tangent  to  the  upper 
radius)  forms  the  upper  leg  of  the  sigmoid.  This  upper  leg  always  has  a  value  of  one  and  a  slope  of  zero. 
The  steepness  of  the  sigmoid  is  increased  by  moving  the  centroids  of  the  two  circles  toward  6  —  0  along 
lines  A  and  C.  Conversely,  the  steepness  is  decreased  by  moving  the  centroids  of  the  two  circles  away  from 
6  =  0  along  lines  A  and  C.  Since  the  lower  radius  r„  is  constrained  always  to  be  tangent  to  line  segment  B 
and  the  upper  radius  is  constrained  always  to  be  tangent  to  the  horizontal  line  of  unit  height,  the  radii  are 
proportional  to  their  centroids'  horizontal  distances  from  the  vertical  line  at  6  =  0.  In  the  limit  that  these 
centroid  distances  are  zero  (corresponding  to  a  confidence  parameter  tl'  of  zero),  (7  [6 ,  V’]  is  a  Heaviside 
step  function.  In  the  limit  that  these  centroid  distances  are  their  maximum  values  (corresponding  lo  i}’  =  \ ), 
(T  [6 ,  V']  is  a  nearly  linear  function  of  6  when  6  <  1 ,  otherwise  it  assumes  its  maximum  value  of  unity. 
Figure  2.6  shows  (7  [6 ,  t/’]  for  eight  different  values  of  its  (single)  confidence  parameter  V'- 

Recall  from  section  2.2.4  that  the  synthetic  CFM  objective  function  must  satisfy  the  following  constraints: 

I .  The  function  must  have  finite  lower  and  upper  bounds  /  and  h: 

—  00  «:  /  <  <7  [6 , (/’]  <  A  4:  oo  (D.l) 

'The  parameters  that  specify  line  segments  A.  B.  and  C  in  figure  D.l  were  chosen  by  the  author  using  a  graphic  tool  designed  for 
this  express  purpose.  The  qualitative  design  criterion  for  the  synthetic  function  were  I)  that  it  retain  a  significant  non-zero  slope,  even 
when  its  transition  region  becomes  steep,  and  2)  that  the  arcs  connecting  the  three  line  segments  of  the  function  have  reasonably  lar^ 
radii  for  all  but  very  sleep  transition  regions.  This  latter  characteristic  ensures  relatively  small  higher-order  derivatives  for  the  synthetic 
function  at  the  arc  segments. 
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(^Ini  Vln)  f  II _ II 


Figure  D.l:  Details  of  the  synthetic  asymmetric  sigmoidal  form  of  the  classification  figure-of-merit  (CE^). 
This  synthetic  function  is  shown  for  various  confidence  parameter  ( ij' )  values  in  figure  2.6. 

The  synthetic  function  is  bounded  on  [/  =  0,A  =  I)  for  -I  <  6  <  I ,  so  it  satisfies  this  constraint 
for  classifiers  with  outputs  bounded  on  [0,1].  Since  any  classifier’s  output  state  can  be  nonnalized 
to  the  interval  [0,1]  by  a  simple  affine  transformation,  this  synthetic  function  can  be  used  with  any 
classifier. 

2.  The  function  must  be  be  a  strictly  non-decreasing  sigmoidal  function  of  6 : 

{^(J  [5,t('’]  >  0,  forsmall|5| 

(D.2) 

^(7  [5,0]  >  0,  otherwise 

Equation  (D.8)  and  figures  2.6  and  D.  I  confirm  that  the  synthetic  function  has  this  property. 

3.  The  function  must  have  a  maximum  slope  occurring  in  its  transition  region.  This  transition  slope 
should  be  inversely  proportional  to  the  confidence  parameter  V’ : ' 

max  oc  0"',  0  G  (0,1)  (D.3) 

S 

By  inspection  of  figure  D.l,  it  is  clear  that  max^  ^(7  [5,0]  =  oj? .  The  constraint  that  a*  be 

proportional  to  t/'  '  ensures  that  the  function’s  derivative  in  the  transition  region  is  bounded  for  all 
non-zero  values  of  0 .  Section  D.3  confirms  that  the  synthetic  function  has  this  property. 

4.  The  lower  leg  of  the  sigmoidal  function  must  have  a  positive  slope,  which  should  be  linearly 
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proportional  to  V : 

min 
^  <0 

This  constraint  ensures  that  the  derivative  of  the  function  retains  a  significant  positive  value  for  negative 
values  of  ^ ,  as  long  as  V  >s  greater  than  zero.  This,  in  turn,  ensures  that  gradient-based  searches 
used  to  optimize  the  parameters  of  the  classifier  by  maximizing  CFM  do  not  exhibit  exponentially 
long  convergence  times  as  the  steepness  of  the  sigmoidal  function’s  Uansition  region  grows  large. 
Section  D.3  explains  this  property  in  more  detail  and  confirms  that  the  synthetic  function  has  it. 


[<5,v>]  oc  ti',  t!'  G  (0,1) 


(D.4) 


5.  The  sigmoidal  function  should  span  a  continuum  between  an  approximately  linear  function  of  S  for 
V’  =  1  to  a  step  function  of  S  for  0  ->  O’*" ; 


lim^/,_„  <^[^.0]  «  aoS  +  bo, 

=  fl|U+{5)  +  fc, 


(D.5) 


where  ao ,  bo,  a\,  and  b\  are  constants  and 


U+(<5)  = 


5  <  0 

6  >  0 


(D.6) 


Equations  (D.7)  and  (D.IO)  —  (D.46)  and  figures  2.6  and  D.l  confirm  that  the  synthetic  function  has 
this  property. 

Section  D.6  lists  the  source  code  that  implements  this  synthetic  form  of  the  CFM  objective  function.  The 
precise  mathematical  expressions  for  (7  ,  0]  and  its  first  two  derivatives  are 


f  <(<5+  I), 


<7  [6,0]  =  I 


+  b;. 


flyp 


+  yjrl  - 


I 


-1  <  ^  < 

Xfnn  <  6  <  Xtn 
Xm  ^  ^ip 

Xo,  <  S  < 

6  >  Pn> 


(D.7) 
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' 

>0 

i  ^  ^  ^  Xtnn 

[»■«  -  -  Pm)  , 

Xtnn  ^  ^  Xfn 

>0  >0 

'1  = 

-S’ 

X,n  <  ^  <  X,p 

(D.8) 

>0 

-  [tj  -  {S  -  -  /(,p) . 

<0  <0 

X,p  <  S  <  p,rp 

.  0. 

^  ^  ftjp 

’  0. 

-1  <  5  <  jr,„„ 

-  (S  - 

■[ 

(•5  -  /'m)'  ■  K  ~  ”  f'l.)']”'  +  *]  ■ 

Xm„  <  S  <  Xp, 

II 

0. 

. 

Xm  <  ^  <  X,p 

(D.9) 

•1 

(5  -  Pxp)^  •  [r^p  ~  0  -  Ptp)^]~'  +  l]  . 

X,p  <  S  <  p^ 

.  0, 

^  >  i'xp 

Each  time  ij'  is  changed,  the  following  computations  (listed  in  regressive  order)  must  be  performed  to 

update  the  synthetic  function  (all  angles  are  in  radians): 

K 

—  yip  ®p  ^ip 

(D.IO) 

Xtp  =  /V  +  rpCos(Zi) 

(D.ll) 

yip  =  /bp  +  O' sin(Zi) 

(D.12) 

Xm  =  Pm  -  r„COS{lf) 

(D.13) 

L\ 

tt 

=  2-*-^'’ 

(D.14) 
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\l>xr  -  f'wj  \  J 


0\  —  \/(/*V’  /*ro)^  (/*'7>  /*.'■ 


=  taii(/3) 


Xm„  =  D^cos(l^)  -  I 


Ly  —  tan 


-I  / 

V/'«n  +  • 


)  ■  (£) 


Di  =  -  '’t 


L>n  —  \l(l-*xn  +  1)^  /^yn 


r„  =  R„  ■  V’ 


/'™  =  -<.„COS(l4) 


=  C  sin(Z4) 


Oo 


=  \  -  r„ 


=  •  0 


(D.I5) 


(D.I6) 


(D.17) 


(D.18) 


(D.19) 


(D.20) 


(D.21) 


(D.22) 


(D.23) 


(D.24) 


(D.25) 


(D.26) 


(D.27) 


(D.28) 


(D.29) 
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The  following  quantities  are  constants  (all  angles  are  in  radians): 


^4 

— 

tan-'  (  ) 

(D.30) 

n„ 

+  fU 

(D.31) 

P.mO 

= 

Xi  -  R„  cos(^  -  u) 

(D.32) 

= 

y\  +  R„  sin{^  -  Z5) 

(D.33) 

Xi 

= 

jtq  +  f-sin(Z5) 

(D.34) 

.Vl 

= 

yo  +  Lcosils) 

(D.35) 

•To 

02 

(D.36) 

Ol  -  02 

.VO 

O1O2 

(D.37) 

Ol  -  02 

L 

R. 

(D.38) 

tan  Z7 

^5 

= 

tan-'{a2) 

(D.39) 

^6 

= 

tan-'{ai) 

(D.40) 

^7 

TT  lan-'(ai)  -  tan~'(a2) 

2  2 

(D.41) 

Rn 

= 

0.7  (i.e..r„|  V’  =  1) 

(D.42) 

02 

= 

0.5 

(D.43) 

O] 

= 

5.0 

(D.44) 

Rr 

-flo  =  0.5  (i.e.,  TpIV'  =  1) 

(D.45) 
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ao  =  -0.5  (D.46) 

D.2  The  Computational  Cost  of  the  Synthetic  CFM  Objective  Function 

Since  the  steepness  of  the  sigmoid  is  adjusted  infrequently,^  evaluation  of  this  synthetic  function  and  its  first 
two  derivatives  involves  few  floating  point  computations,  as  indicated  by  (D.7)  —  (D.9).  The  function  is 
evaluated  by  comparing  its  argument  with  the  intervals  on  6  corresponding  to  the  three  line  segments  and 
two  radii.  In  the  case  that  the  argument  corresponds  to  a  line  segment,  the  function  evaluation  requires  one 
multiplication  and  one  addition,  and  its  derivative  evaluations  require  a  constant  look-up.  In  the  case  that  the 
argument  corresponds  to  a  radius,  the  function  evaluation  requires  one  multiplication,  three  additions,  and 
one  square  root  computation;  its  first  derivative’s  evaluation  requires  two  multiplications,  two  additions,  and 
one  square  root  computation;  its  second  derivative’s  evaluation  requires  four  multiplications,  three  additions, 
and  one  square  root  computation.  Thus,  the  computational  cost  of  evaluating  this  synthetic  function  and  its 
first  two  derivatives  is  comparable  to  the  cost  of  evaluating  the  logistic  sigmoidal  form  of  CFM  [SS]  (see 
section  D.3. 1 )  and  its  first  two  derivatives. 

D3  The  Convei^ence  Properties  of  Differential  Learning  via  the  CFM 
Objective  Function 

As  described  by  definitions  2.8  and  2.10,  the  differentiable  supervised  classifier  employing  differential 
learning  learns  by  maximizing  the  CFM  objective  function  via  a  search  (i.e.,  a  numerical  optimization 
procedure)  on  parameter  space.  Regardless  of  the  search  algorithm’s  specific  characteristics  (e.g.,  [106, 
ch.  10]),  it  uses  the  first  derivative  of  the  objective  function  in  order  to  update  the  classifier's  parameters 
iteratively.  The  magnitude  of  the  parameter  change  induced  by  each  iteration  of  the  search  —  that  is,  the  rate 
at  which  the  classifier  learns  —  is  proportional  to  the  objective  function’s  first  derivative  (see  section  S.S). 
The  magnitude  of  this  derivative,  in  turn,  is  sigmoidally  related  to  the  discriminant  differential  engendered 
by  the  training  example.  This  leads  us  to  define  three  classes  of  training  examples  on  the  basis  of  the 
discriminant  differentials  they  engender.  The  following  definitions  are  illustrated  in  figure  D.2. 

Definition  D.l  Un-learned  example;  This  is  a  training  example  that  exhibits  a  negative  discriminant 
differential  (i.e.,  one  that  the  classifier  does  not  classify  correctly). 

Definition  D.2  Learned  example:  This  is  a  training  example  that  exhibits  a  positive  discriminant 

differential  (i.e.,  one  that  the  classifier  does  classify  correctly).  The  magnitude  of  the  discriminant  differential 

^Afain.  this  adjustment  is  made  by  altering  the  confldence  parameter  l/'  C  (0.1]  of  the  function.  Such  an  alteration  requires  the 
re-computation  of  the  radii  and  their  squares,  the  radius  centroids,  tangent  points,  and  linear  coefficients  shown  in  figure  D.  I .  Once 
these  values  are  computed  via  equations  (D.IO)  —  (D.29).  they  need  not  be  re-computed  until  and  unless  the  confidence  parameter  is 
changed  again. 
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Figure  D.2:  Three  types  of  training  examples:  un-leamed  examples  exhibit  negative  discriminant  differentials; 
transition  examples  exhibit  discriminant  differentials  that  correspond  to  the  transition  region  of  the  synthetic 
CFM  sigmoid  (therefore,  some  un-leamed  examples  are  also  transition  examples);  learned  examples  have 
positive  differentials  that  correspond  to  the  maximum  CFM  value  of  unity. 


6  is  large  enough  that  the  CFM  it  elicits  is  the  maximum  value  of  unity  ((T  [5,  tj']  =  I ).  Thus,  the  minimum 
discriminant  differential  that  a  learned  example  can  exhibit  depends  on  the  confidence  parameter  4’  of  the 
CFM  objective  fimetion;  this  minimum  value  of  5  is  p^p  (see  figure  D.  I ). 

Definition  D.3  Transition  example:  This  is  a  training  example  that  exhibits  a  discriminant  differential 
S  (either  positive  or  negative)  for  which  O'  [5 ,  t/’]  is  in  the  transition  region  of  the  sigmoidal  function. 

Remark:  Note  that  an  unlearned  example  may  also  be  a  transition  example. 

If  we  accept  the  differential  notion  of  learning  for  statistical  pattern  recognition  as  detailed  in  chapter  2, 
then  un-leamed  and  transition  training  examples  are  the  only  ones  that  concern  us.  That  is,  a  training  example 
is  either  learned  (by  definition  D.2)  or  it  is  not,  and  we  are  concerned  only  with  those  that  are  not.  The 
convergence  properties  of  differential  learning  via  CFM  follow  from  an  analysis  of  the  rate  at  which  these  yet 
un-leamed  examples  are  learned  (i.e.,  the  rate  at  which  they  are  transformed  to  learned  examples,  as  defined 
above).  We  wish  this  learning  to  proceed  at  a  reasonable  rate,  so  we  must  avoid  unreasonably  slow  learning. 

Definition  D.4  Unreasonably  slow  learning  strategy:  Since  the  rate  at  which  a  training  example  is 

learned  is  proportional  to  -^O  [<5.  V']  •  ^’here  S  is  the  discriminant  differential  elicited  by  the  example. 

transition  examples  have  the  highest  learning  rate.  We  denote  the  ratio  of  [5,  t/']  for  transition  examples 

to  for  unlearned  examples  by  </>{!/'’)•  V  increases  exponentially  with  decreasing  V', 
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learning  becomes  dominated  by  the  transition  examples  for  small  t!' :  the  c  lassifier’s  parameters  are  updated 
to  transform  the  transition  examples  into  learned  examples,  while  the  un-learned  examples  are  ignored 
(because  the  derivatives  they  elicit  are  so  small  in  comparison  to  those  of  the  transition  examples).  Under 
these  circumstances,  it  takes  an  unreasonably  long  time  (e.g.,  158.  pp.  155-158])  to  learn  the  yet  un-learned 
training  examples,  and  we  characterize  the  (differential)  learning  strategy  as  unreasonably  slow. 

Definition  D.5  Reasonably  fast  learning  strategy:  A  learning  strategy  that  is  not  unreasonably  slow  by 
definition  D.4  is  reasonabiy  fast. 

D.3.1  1  he  Convergence  Properties  Differential  Learning  via  the  Original  Logistic 
Sigmoidal  Form  of  CFM 

Figure  D.3  shows  the  original  logistic  sigmoid  functional  form  used  for  the  CFM  objective  function  [55]: 

<7  [S,i3]  =  o  [1  +  exp{-/j/5  +  0]"'  (D-47) 

The  linear  scaling  parameter  a  is  generally  taken  to  be  unity,  the  parameter  c  sets  the  horizontal  offset 
of  the  sigmoid's  transition  region,  and  the  parameter  1  <  /J  <  oo  sets  the  steepness  of  the  sigmoid’s 
transition  region.  Note  that  (3  in  this  original  functional  form  is  proportional  to  the  inverse  of  the  synthetic 
function’s  confidence  parameter: 

/)  oc  V’”'  (D.48) 

—  a  relationship  that  validates  definitions  D.  I  —  D.5  for  the  logistic  sigmoidal  form  as  well  as  the  synthetic 
form  of  CFM.  From  (D.47)  it  is  straightforward  to  prove  that  the  first  two  derivatives  of  (7  [5,/^]  with 
respect  to  S  are  given  by 


,D.49) 

and 

=  fi<7lS.„]  (l  -  iMj  -  221^)^  ,0) 

Recall  that  a  negative  discriminant  differential  S  indicates  a  misclassified  training  example.  An  objective 
function  with  a  non-zero  first  derivative  for  negative  differentials  is  therefore  essential  to  reasonably  fast 
learning.  We  know  from  chapter  2  that  the  CFM  objective  function  must  sometimes  approximate  a  step 
function  in  order  to  guarantee  that  the  classifier  approximates  the  error  rate  of  the  Bayesian  discriminant 
function  as  closely  as  possible.  The  original  logistic  sigmoidal  form  of  CFM  shown  in  figure  D.3  has  a  very 
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I  igun;  D.3:  Left;  I'be  original  logistic  sigmoidal  form  of  the  CFM  objective  function  for  four  values  of  the 
steepness  parameter  ti  (figure  adapted  from  f55|).  The  differential  learning  rate  decreases  exponentially  with 
increasing  it  for  all  training  examples  that  generate  discriminant  differentials  ((5s)  in  the  gray  shaded  region 
of  the  plot  (see  lemma  D.l).  This  region  varies  with  0 :  it  comprises  all  values  of  5  to  the  left  of  the  point 
at  which  the  plot  for  a  given  value  of  ft  intersects  with  the  shaded  background.  Right:  The  function's  first 
derivative  with  respect  to  the  discriminant  differential  6  for  the  same  four  values  of  ft . 


small  tlrsi  derivative  (right  side  of  the  figure)  for  negative  discriminant  differentials  when  it  approximates 
the  step  function. '  As  implied  in  definition  D.4,  this  prevents  the  search  algorithm  at  the  heart  of  the  learning 
strategy  from  converging  in  time  that  is  a  polynomial  function  of  the  steepness  parameter  ft.  That  is,  the 
first  derivative  of  the  original  logistic  sigmoidal  fonn  of  the  CFM  objective  function  decreases  in  exponential 
proportion  to  increasing  it  for  <5  <  0 .  As  a  direct  result,  the  learning  rate  of  any  search  algorithm  relying., 
on  this  first  derivative  decreases  exponentially  with  increasing  ft  for  (5  <  0. 

l.emma  I).!  The  rale  of  differential  learning  via  the  original  logistic  sigmoidal  form  of  the  CFM  objective 
function  generally  decreases  as  >  \.ft  €  [},oo])  for  un-leamed  examples. 

Proof  :  We  fix  the  parameters  n  =  I  and  v  =  0  in  (D.47)  without  loss  of  generality.  By  (D.47)  and 
(D.49) 


_ ft _ 

exp{d(5)  (I  +  2exp{— /J(5)  +  exp(-2/)<i)) 


‘  rttivioiKly.  (he  logistic  form  of  CFM  also  has  a  very  small  first  derivative  for  positive  discriminant  differentials  when  it  approximates 
the  step  function.  However,  positive  discriminant  difierentials  correspond  to  fetrnierf  training  exan^ttes  when  CFM  approximates  the 
step  function  We  are  not  p.srticulnrly  concerned  with  le.snted  examples;  rather  we  are  concerned  primarily  with  tm-lenmed  examples, 
which  crhiNi  nmaih  f  discriminant  differentials  If  the  objective  function  has  very  small  derivatives  for  negative  differemiats.  it  will 
lake  unreasonably  long  to  learn  the  yet  un  learned  examples.  An  a.symmelric  sigmoidal  form  for  CFM  exhibits  very  small  derivatives 
for  ret.ilively  Isrpe  positive  differentials,  so  it  effectively  Ignores  training  examples  that  have  already  been  learned.  At  the  same  time, 
ihe  asymmetric  form  reiains  sirable  derivatives  for  all  negative  differentials,  thereby  focussing  on  learning  the  yet-unleamed  examples. 
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_ li _ 

2  +  exp(l)S)  +  exp(—ilS) 


(D.51) 


ir (his  derivative  decreases  exponentially  with  increasing /i,  then  there  must  exist  some  constant  (  >  I  for 
which 


_ 

2  +  exp(/i(5)  +  exp(— /3<5) 
or 

In  (2  +  exp(M)  +  exp(-/M))  -  ln(/?) 

- ^ -  >  ln(() 

>0 

Since  In  (2  +  exp(/M)  +  exp(—i)S))  >  ln(exp{— /M))  (D.53)  is  satisfied  if 

Zs  ^ 

i 

The  bound  is  tight  for  large /.J,  but  loose  when  |/?5|  «  I .  Thus,  the  first  derivative  of  the  logistic  sigmoidal 
form  of  CFM  decreases  exponentially  for  increasing  fi  when  the  discriminant  differential  is  less  than  the 
upper  bound  given  by  the  right-hand  side  of  (D.54).  This  bound  is  plotted  in  figure  D.4.  The  left  side  of 
figure  D.3  also  depicts  the  bound:  it  is  the  point  at  which  <7  («5 ,  /?]  intersects  the  gray  shaded  background. 
This  leads  us  to  conclude  that  the  first  derivative  of  the  logistic  sigmoidal  form  of  CFM  generally  decreases 
exponentially  with  increasing  l3 : 


(D.52) 


(D.53) 


^a[6,t)]  =  0[C'^  V5  <  C  >  1  (D.55) 

Note  that  the  minimum  of  occurs  at  exp(-l)  «  -.368,  so  (D.55)  holds  for  all  <5  <  -.368,  regard¬ 
less  of  the  value  of  j) .  Since  the  learning  rate  for  yet  un-ieamed  examples  is  proportional  to  ^<7  [6 ,/?] , 
the  theorem  is  proven.  I 


Lemma  D.2  The  rate  of  differential  learning  via  the  original  logistic  sigmoidal  form  of  the  CFM  objective 
function  generally  increases  as  0\P\  ( €  [X.-x])  for  transition  examples. 

Proof  :  We  fix  the  parameters  o  =  I  and  (  =  0  in  (D.47)  without  loss  of  generality.  By  inspection  of 
(D.50),  we  solve  for  the  value  of  6  that  yields  a  CFM  second  derivative  of  zero;  this  occurs  at  <5  =  0  for 


all  choices  of  ti : 


D.3  Differential  Learning  Convergence  Properties 


339 


Figure  D.4;  The  logistic  sigmoidal  form  of  the  CFM  objective  function  has  a  first  derivative  <7  [S,I3]  (see 
figure  D.3,  right)  that  decreases  exponentially  with  increasing  steepness  parameter  j)  when  the  discriminant 
differential  S  of  the  classifier  falls  below  the  upper-bound  value  shown  above.  Note  that  the  upper  bound  on 
8  varies  with  fi ;  in  no  case  is  it  less  than  exp~  «  .368 .  Figure  D.3  (left)  plots  this  upper  bound  value  of 
8  in  light  gray  for  all  /3  >  I . 


[6  =  0./i]  =0  V  J  (D.56) 

By  (D.5I), 

^O[8  =  0,,)\  =  ^  =  (PUi)  V/f  (D.57) 

■ 

Lemmas  D.  I  and  D.2  lead  us  to  the  following  theorem: 

Theorem  D.l  Differential  learning  via  the  original  logistic  sigmoidal  form  of  the  CFM  objective  function  is 
unreasonably  slow  for  »  I . 

Proof  :  We  fix  the  parameters  a  =  I  and  C  =  0  in  (D.47)  without  loss  of  generality.  It  is  straightforward 
to  prove  that  the  derivative  decreases  with  increasing  |5|  on  the  tails  ofthe  sigmoid  (i.e.,  for  ■-f}8  -I-  C  ^  I ): 

inspection  of  ^<7  in  {D.5I)  reveals  that  it  is  positive-definite,  and  lemma  D.2  proves  that  it  is 

maximum  sx  8  =  0 .  The  ratio  of  the  derivative  in  the  transition  region  (lemma  D.2)  to  the  derivative  in 
the  lower  tail  (lemma  D.l)  gives  us  a  ratio  of  the  learning  rate  for  transition  examples  (i.e.,  ones  that  have 
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small  discriminant  differentials)  to  the  learning  rate  for  yet  un-leamed  examples.  We  refer  to  this  ratio  of 
derivatives  as  the  learning  rate  ratio  ,  which  is 


A 


2  +  exp(^/f)  +  cxp(— (5/3) 
4 


=  0[exp(|(5l/3)].  »  I,  (5  <  0 


(D.58) 


Thus,  by  definition  D.4,  lemmas  D.l  and  D.2,  and  (D.58).  differential  learning  via  the  logistic  sigmoidal 
form  of  CFM  is  unreasonably  slow  when  /i  >  I .  I 


Remark:  Theorem  D.l  means  that  it  takes  a  classifier  employing  differential  learning  via  a  gradient-based 
search  and  the  logistic  sigmoidal  form  of  CFM  an  unrea.sonably  long  time  to  learn  some  training  examples. 
Un-leamable  examples  are  ones  that  require  a  steep  CFM  sigmoid  to  be  learned  (see  section  2.4  and 
section  D.4);  the  first  derivative  of  the  logistic  sigmoid  for  these  un-leamable  examples  (which,  by  definition, 
have  a  negative  discriminant  differential  S )  is  so  small  that  it  would  take  an  unreasonably  large  number  of 
search  iterations  to  modify  the  classifier’s  parameters  enough  for  the  example  to  be  correctly  classified  (i.e., 
learned).  The  gray  curve  in  figure  D.5  shows  the  learning  rate  ratio  d>(/i)  for  values  of  f)  from  2  to  30.  The 
curve  assumes  a  nominal  discriminant  differential  value  of  ^  =  —.7  in  (D.S8)  for  the  un-leamed  examples. 
In  practice,  values  of  /)  that  exceed  10  result  in  unreasonably  slow  learning,  owing  to  the  dominance  of 
the  transition  example  learning  rate  (note  that  =  315  for  /)  =  10.5).  Numerical  “tricks”  such 
as  increasing  the  step  size  of  the  learning  algorithm  to  increase  the  learning  rale  for  yet  un-leamed  training 
examples  do  not  compensate  for  the  domin-mce  of  transition  examples.  In  practice,  they  lead  to  unstable 
oscillations  in  the  search  algorithm  (in  [55]  it  was  found  that  ii  had  to  be  less  than  about  10  to  prevent 
unstable  learning).  The  net  result  is  that  /)  must  be  kept  small  to  I )  prevent  unreasonably  slow  learning  of 
yet-unlearned  training  examples,  and  2)  prevent  oscillations  in  the  learning  algorithm.  Small  values  of  fi 
prevent  (2.96)  from  being  satisfied  for  all  points  on  X '  training  examples  are  un-leamable  and  the 

resulting  classifier  does  not  achieve  as  low  an  error  rate  as  it  might.  This  combination  of  deficiencies  led  us 
to  develop  the  synthetic  form  of  CFM. 


D 3.2  The  Convergence  Properties  of  Differential  Learning  via  the  Synthetic  Form 
of  CFM 

Differential  learning  via  the  synthetic  form  of  the  CFM  objective  function  remains  reasonably  fast  and  free 
of  unstable  oscillations,  even  when  the  transition  region  of  the  sigmoid  is  quite  steep. 
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Figure  D.5:  The  ratio  <^>{  • )  of  the  differential  learning  rate  for  transition  examples  to  that  for  un-leamed 
examples  with  a  nominal  discriminant  differential  value  of  <5  =  —.7 .  The  ratio  is  plotted  over  the  range  of 
the  confidence  parameter  that  regulates  the  steepness  of  the  CFM  objective  function's  sigmoidal  form.  Ratios 
for  the  original  logistic  sigmoidal  form  ( c!>(/3) ,  gray)  and  the  synthetic  form  ( <^( V’) .  black)  arc  shown.  Note 
that  the  /?  scale  has  been  warped  to  match  the  V’  scale:  values  of  /?  arc  shown  along  the  gray  curve.  Light 
gray  shading  under  the  curves  indicates  the  range  of  /J  and  ij'  values  for  which  learning  is  reasonably  fast 
and  stable.  The  black  curve  shows  that  the  synthetic  form  of  CFM  remains  reasonably  fast  for  un-leamed 
examples  as  its  transition  region  becomes  steep  ( t'  .15);  in  contrast,  the  gray  curve  shows  that  the 
logistic  sigmoidal  form  of  CFM  becomes  unreasonably  slow  for  un-leamed  examples  while  its  transition 
region  is  still  relatively  shallow  (0  =  10.5).  Figure  D.8  (top)  shows  both  forms  of  CFM  for  0  =  10.5 
(logistic  sigmoidal)  and  the  equivalent  =  .49  (synthetic). 


The  rate  of  differential  learning  via  the  synthetic  form  of  the  CFM  objective  function  generally  decreases 
as  O  [t,^']  ((/'  €  (0,1))  for  un-leamed  training  examples. 

Proof  :  It  can  be  shown  that  the  first  derivative  of  the  synthetic  CFM  objective  function  described  above 
is  always  greater  in  the  lower  radius  and  the  transition  region  than  it  is  in  the  lower  leg  —  regardless  of  the 
valueof  V'  €  (O.ll.FigureD.l  makesthisplain.sowe  forego  the  rigorousproof.  Given  this  relationship,  the 
first  derivative  of  the  synthetic  CFM  objective  function  is  bounded  from  below  for  all  negative  discriminant 
differentials  (i.e.,  for  all  un-leamed  examples): 

^<7  [5. 1,/']  >  o;  <  0  (D.59) 


By  (D.18)-(D.29), 
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a;  =  tan(Z,) 

=  - """  (£)) 

Since  //,„  =  V’7?„  sin(Z4)  and  /»„  =  005(^4), 

lim  tan“' ( — .  )  =  lan“' (V’T^n  sin(/^4))  =  sin(Z4)  (D.61) 

\/'r«  +  */ 

Since  lim  L„  =  1  (this  is  readily  verifiable  in  figure  D.  I ,  so  again  we  forego  the  rigorous  prooO, 

lim  sin"’  (-pj  ^  sin"' S'  (,’/?„  (D.62) 

^''_*0+  \^n/ 

By  (D.59)  -  (D.62). 

lim  [<5 , 0]  =  lim  o*  S'  tan  (t/’(7?„  sin(Z4)  -  Af„)) 

l/''-40+ 

Si  t!>  {n„  -  R„) 

=  k  a  constant) 

=  0[ti']  V()  <  0  (D.63) 

Thus,  the  first  derivative  of  the  synthetic  CFM  objective  function  is  O  [V]  for  all  negative  discriminant 
differentials  when  V'  is  small.  In  fact,  the  relationship  holds  approximately  for  all  values  of  v>  ■  Figure  D.6 
shows  that  the  slope  of  the  synthetic  CFM  objective  function’s  lower  leg  is(?[(/’"^]  —  or  approximately 
linear  with  respect  to  ij’  — for  all  tj’  €  (0,1].  Since  the  learning  rate  for  un-leamed  examples  is 
proportional  to  a* ,  the  theorem  is  proven.  I 

Lemma  D.4  The  rate  of  differential  learning  via  the  synthetic  form  of  the  CFM  objective  function  generally 
increases  as  O  JV'  'j  (  V'  €  (0, 1  j )  for  transition  examples. 

Proof  :  By  our  specification  of  the  synthetic  CFM  objective  function  in  section  D.l,  it  always  attains  its 
maximum  derivative  of  o*  in  the  transition  region  (see  figure  D.l). 
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0.5 


0.0  0.5  1.0 


Figure  D.6;  The  slope  a*  (black)  of  the  synthetic 
CFM  objective  function’s  lower  leg,  as  a  function  of 
the  confidence  parameter  tj' .  Note  that  a*  is  O  [t/’J 
for  0  <  0. 1 5  (see  the  proof  of  theorem  D.3),  and 
O  V' for  0.15  <  V  <  !•  The  linear  and 

O  V'*  asymptotes  are  shown  in  light  gray. 


Figure  D.7:  The  slope  a*  (black)  of  the  synthetic 
C:FM  objective  function's  transition  region,  as  a 
function  of  the  confidence  parameter  V’-  ^^ote 

that  a*  is  O  for  ail  (see  the  proof  of 

leibma  D.4),  as  indicated  by  the  asymptote  (which  is 
a  linear  function  of  tJ'  )  shown  in  light  gray. 
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tan  f tan  ' 

V  \xi'(\  +  n„cQ%(u)))  ) 

_ 1 _ 

t}'(\  +  n„  cosiu)) 

o[v'~']  (D.67) 

(0,1] ,  as  illustrated  by  figure  D.7.  Since  the  learning 
rate  for  transition  examples  is  proportional  to  a* ,  the  theorem  is  proven.  I 

Theorem  D.2  Differential  learning  via  the  synthetic  form  of  the  CFM  objective  function  is  reasonably  fast. 

Proof  :  The  ratio  of  the  derivative  in  the  transition  region  (lemma  D.4)  to  the  derivative  in  the  lower  tail 
(lemma  D.3)  gives  us  the  learning  rate  ratio  (i.e.,  the  ratio  of  the  learning  rate  for  transition  examples  to  the 
learning  rate  for  yet  un-leamed  examples): 


lim  max  4iG  [6 ,  V']  =  lim  a*  = 


In  fact,  the  relationship  holds  well  for  all  values  of  ij'  € 


d 


a 


It 

n 


By  (D,67)  and  (D.63) 


(D.68) 


lim  <^((/-)  =  [V’Ml  +  7^"Cos(Z4))  (7ZnSin(Z4)  -  «,)1 

=  O  (D.69) 

In  fact,  the  learning  rate  ratio  <i>{v)  remains  O  for  all  V  €  (0,1),  so  differential  learning  via 
synthetic  CFM  is  reasonably  fast.  I 

Remark:  From  a  practical  viewpoint,  differential  learning  via  both  forms  of  CFM  becomes  slow  when  the 
learning  rate  ratio  exceeds  about  315.  Figure  D.5  illustrates  that  for  t!'  >  .15  the  synthetic  CFM  learning 
rate  ratio  is  low  enough  to  ensure  reasonably  fast  learning.  A  comparison  of  the  <l>(  if')  curve  with  the 
curve  for  the  logistic  sigmoidal  form  of  CFM  emphasizes  that  the  first  derivative  of  the  steep  synthetic  CFM 
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\)/=  .49 


6 


V  =  .15 


6 


Figure  D.8:  Equivalent  logistic  (dashed  gray)  and  synthetic  (solid  black)  CFM  functional  toms.  Note 
that  the  slope  and  shape  of  both  functions  is  approximately  the  same  in  the  transition  region  and  upper 
leg.  In  the  lower  leg  the  synthetic  function’s  slope  is  orders  of  magnitude  larger  than  its  logistic  sigmoidal 
counterpart's,  (fop):  The  logistic  sigmoidal  form  has  a  horizontal  offset  value  of  C  =  A  (i  (see  (D.47)). 
Given  the  level  of  CFM  steepness,  the  learning  rate  ratio  is  1 1  for  the  synthetic  function  and  315  for  its 
logistic  sigmoidal  counterpart:  differential  learning  via  the  logistic  sigmoidal  fom  is  unreasonably  slow  for 
un-leamed  examples,  (bottom);  The  logistic  sigmoidal  fom  has  a  horizontal  offset  value  of  C  =  -05  (i  (see 
(D.47)).  Given  this  level  of  CI^  steepness,  the  learning  rate  ratio  is  315  for  the  synthetic  function  and  10*’ 
for  its  logistic  sigmoidal  counterpart.  Differential  learning  via  the  synthetic  fom  of  CFM  remains  reasonably 
fast  and  tenable  for  un-leamed  examples  as  long  as  its  sigmoid  is  no  steeper  than  this  ( i/’  ~  -  IS ). 


objective  function  is  orders  of  magnitude  larger  than  that  of  the  comparable  logistic  sigmoidal  fom;  as  a 
result,  the  learning  rate  ratio  for  (he  synthetic  fom  of  CFM  is  orders  of  magnitude  smaller  than  that  for  the 
comparable  logistic  sigmoidal  fom.  Figure  D.8  compares  the  logistic  sigmoidal  and  synthetic  foms  of  CFM 
for  two  cases.  The  fop  figure  shows  the  two  foms  for  the  value  of  fJ  =  10.5  at  which  differential  learning 
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via  the  logistic  sigmoidal  form  of  CFM  becomes  slow.  For  this  value  of  d ,  6(1))  »  315 ;  the  comparable 
synthetic  form  of  CFM  has  a  confidence  parameter  of  t’  =  .49,  for  which  </>((/')  «  1 1 ,  so  un-leamed 
examples  are  learned  approximately  30  times  faster  than  they  are  using  the  logistic  sigmoidal  form  of  CFM. 
The  bottom  figure  shows  the  two  forms  for  the  value  of  (,'•  =  .15  at  which  differential  learning  via  the 
synthetic  form  of  CFM  becomes  slow.  For  this  value  of  t,*,  «  3l5 ;  the  logistic  sigmoidal  form  of 

CFM  has  a  confidence  parameter  of  ji  =  65 ,  for  which  «  10'9.  The  learning  rates  of  the  two 
functional  forms  differ  by  17  orders  of  magnitude  for  un-leamed  examples,  given  this  level  of  steepness  in 
the  sigmoidal  function. 


D.4  A  Proof  Relating  to  Synthetic  CFM  and  Chapter  2 

Recall  from  section  2.4  that  for  a  given  input  X  the  classifier's  di.scriminalor  generates  C  output  activations 

gi(X|tf),...  ,gc’(X|fl)  and  C’  corresponding  discriminant  differentials  (Si(X|^) . <5f(X  1 6) .  Since 

different  training  examples  of  X  are  assigned  different  empirical  class  labels  according  to  the  a  posteriori 
probabilities  Pvv(x(^^i  |  X) ,  . . .  ,  Pvv(x(t^c  |  X) ,  the  expected  value  of  the  CFM  objective  function  for  X 
is  r^aximized  when,  by  (2.94), 


t^(X)-  =  <7[«5.(X|(>‘),0)  -  fT[0,V’] 

^d(X)-  =  a  [-5.(XIR‘),V’]  -  cr  [0,1-]  ’ 

\ 

U  (i?(X)*  =  -inx)'  =  0) 

jiCFM(Xlff’l  =  0  j 

C(2(i)  =  U).,  s.t.  (J(i){X(d)  =  <5.(X|0') 


(  0(xr  ^  I  -  Pvv|x(a^.|X)\ 
V— t7(X)-  -  P,v|x(u;.lX)  ) 


(D.70) 


where  (jJ.  denotes  the  class  with  the  largest  a  porferiori  probability,  and  ^.IXj®’)  is  the  corresponding 
discriminant  differential.  Recall  that  <$(||(X  1 0)  is  the  discriminant  differential  associated  with  the  classifier's 
largest  output;  likewise,  Ctt(i)  is  the  class  associated  with  the  classifier's  largest  output.  CFM  is  maximized 
when  the  discriminator’s  largest  output  corresponds  to  the  most  likely  class  of  X:  equivalently,  CFM  is 
maximized  when  the  discriminant  differential  associated  with  the  most  likely  class  is  positive  (i.e,  when 
6.(X|R‘)  >  0).  Therefore,  we  simply  wish  to  determine  the  upper  bound  on  ti’,  below  which  this  condition 
is  satisfied  via  (D.70).  The  following  equations  lead  to  such  a  bound  on  V\  Although  we  would  like  to 
motivate  them  with  a  concise  intuitive  explanation,  this  proves  rather  difficult.  We  advise  the  reader  to  rely 
heavily  on  figure  D.l  (taking  6  in  the  figure  to  mean  <5.(XlR')  in  the  present  context);  refer  to  (D.  10)  — 
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(D.46)  when  the  figure  fails  to  resolve  the  question. 

If  we  assume  that  (5.(X|®")  =  =  xj'  (i.e.,  it  is  the  smallest  value  of  ^.(X|®)  for 

which  (7  [($.(X|9), V']  takes  on  its  maximum  value  of  unity  — see  figure  D.l  and  (D.26)),  then 

fT  [^.(X|^'),  V’]  =  I  and  in  (D.70) simplifies  to 


d(X)‘  ^  I  -  a 

—>i)(xy  <j  [o,i/.]  -  (7  [-V'.V’] 

Since  (7  [O,  0]  <  a*  +  r„  =  a*  +  J  ij'  and  <7  [-0.  V']  >  (1  “  0)  '* 


(D.7I) 


>^(X)-  ^  1  -  <  -  .7  0  ^  1  -  (*  +  -7)  0 

— .i)(X)*  -  .7v  +  a;V’  (kij’  +  .7)il< 

where  k  =  sin(Z4)  -  S  .3 .  Therefore,  is  bounded  from  below: 

«:>(X)-  I  -  V’  ^  I  -  0 

-  .3  02  ^  7  0  0(.3  0  +  .7) 


(D.72) 


(D.73) 


This  lower  bound  is  tight  for  small  0  •  It  is  loose  for  t/’  -t  I ,  since  Iim0_^i  ~  I,  whereas  the 

lower  bound  yields 


I  -  xl< 

lim  - 5 — ^ -  =  0 

t'--n  .3t/f  +  .7t/' 


For  smaller  values  of  (/  '  the  bound  can  be  simplified  to 


(D.74) 


-^0(X)‘  - 


•7V’ 


=  1.43  t/ 


-I 


so  that  (D.70)  is  satisfied  if 


or 


),-i  ^  I  -  P>v|x(<^-  |X) 
-  l.43Pw|x(u;.lX) 


t/’  < 


Pvv|x(U2.|X) 

■  1  -  Pvv|x(U2.|X) 


(D.75) 


(D.76) 


(D.77) 


Thus,  the  requirement  for  learning  the  most  probable  class  of  X ,  stated  in  (2.94),  is  satisfied  when  (D.77)  is 
satisfied. 

^These  bounds  on  Ihe  value  of  syniheiic  CFM  are  leadiiy  vcrined  by  a  visual  inspection  of  figure  D.  I . 
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D.5  Modifying  Backpropagation  for  use  with  CFM 

Any  differentiable  supervised  classifier  can  use  the  CFM  objective  function  to  learn  differentially.  Since 
neural  network  classifters  employing  the  backpropagation  learning  procedure  [119,  120]  are  a  popular  family 
of  differentiable  supervised  classifiers,  we  show  how  to  modify  the  backpropagation  algorithm  for  use  with 
CFM. 

There  are  two  fundamental  differences  between  backpropagation  with  CFM  and  backpropagation  with 
error  measures  such  as  MSE; 

•  For  any  given  training  example  representing  one  of  the  C  possible  classes,  CFM  is  a  function  of  only 
two  discriminator  outputs;  error  measures  are  functions  of  all  C  discriminator  outputs. 

•  Cf^  is  maximized,  whereas  error  measures  are  minimized. 

Gradient  computations  —  Recall  from  section  2.4,  the  CFM  generated  by  a  training  sample  S”  of  n 
examples  is  given  by 


I  " 

CFM(5"ld)  =  -  y"  (cr  [5r(X^|d),t/-]  :  =  UJr)  ,  (D.78) 

where  X''  and  denote  the  yth  of  n  training  examples  and  its  associated  class  label.  The  discriminant 

differential  6r(X^|d)  generated  by  the  example  X^  having  the  class  label  =  Ur  (r  G  {I . C}), 

is 


=  g.(X>|0)  -  max  g*{X>|0) 


(D.79) 


Thus,  the  derivative  of  (T  [5t(X^  I  .  V’]  '*  non-zero  with  respect  to  only  two  outputs,  Vt  and  Vr:* 


,  0, 


>-.•  =  3V 

>’<  = 
otherwise 


(D.80) 


Figure  D.9  illustrates  the  significance  of  (D.80)  for  a  hypothetical  classifier  that  learns  via  backpropagation 
nodified  for  use  with  CFM.  The  classifier  has  five  discriminant  functions,  corresponding  to  the  five  classes 
that  the  feature  vector  can  represent.  The  classifier’s  parameters  are  shown  as  black  arrows  pointing  towards 
the  discriminator’s  outputs,  and  the  states  of  the  classifier's  nodes,  given  the  example  X^  ,  are  depicted 


’The  iKXalion  <5r  is  short-hand  for  Sr{\'\  0}  ihroughoui  ihis  section. 
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Figure  D.9:  A  diagrammatic  view  of  backpropagation  with  the  CFM  objective  function.  The  classifier  has 
C  =  5  output  nc^es,  which  correspond  to  its  five  discriminant  functions.  The  classifier’s  input  is  the 
yV-dimensional  feature  vector,  and  the  classifier  has  one  hidden  layer  containing  three  nodes.  The  parameters 
(i.e.,  connections  or  weights)  of  the  classifier  are  depicted  by  black  arrows.  The  figure  depicts  the  state  of  the 
classifier,  given  a  particular  training  example  (darker  nodes  have  larger  values  than  lighter  ones).  The 
cla.ssifier’s  CFM  (7  [Sr ,  I ']  is  a  function  of  the  discriminant  differential  Sr ,  which  is  a  function  of  only 
two  outputs:  the  output  ,Vr  =  }'*'  corresponding  to  the  input  example’s  class  label  (which  is  U4  in  this 
case),  and  the  largest  of/ier  output  y7  =  3-2 .  Thus,  the  derivative  of  (7  [5^ ,  t/’]  is  non-zero  with  respect  to 
outputs  yv  and  yv  only.  The  gray  arrows  pointing  back  through  the  classifier  towards  its  input  denote  all 
the  resulting  non-zero  derivatives  of  (<7  [<5t  ,  t/'’] ) .  the  gradient  of  CFM  with  respect  to  the  classifier’s 
parameters,  given  the  single  training  example/class  label  pair  (X^ ,  VW^). 


in  grayscale  (darker  nodes  have  larger  values  than  lighter  ones).  Since  X^  is  an  example  of  class  U4, 
y'r  =  y’4  ■  Likewise,  since  y’2  is  the  most  active  of  all  other  discriminator  outputs,  yv  =  y'2- The  discriminant 
differential  St  is  therefore  y’4  —  72.  Note  that  yv  >  yv,  so  Sr  is  negative  and  is  an  un-leamed 
example  (definition  D.l  —  i.e.,  the  classifier  has  not  yet  learned  to  classify  X^  correctly). 

We  denote  the  gradient  of  <7  [(5t(X^ |(l), V]  with  respect  to  the  classifier's  parameters  6  by 
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Vff  ((T  [(5r(X^  Since  only  the  derivatives  <7  [<Sr(X^  1 0) , l/']  and  ^  (7  [(5r(X^  1 0) ,  V’] 

arc  non-zero,  only  the  parameters  associated  with  the  second  and  fourth  discriminant  functions  affect  the 
value  of  CFM  for  this  example.  Indeed,  only  those  elements  of  ((7  1 0) ,  V'] )  corresponding  to 

the  parameters  of  the  second  and  fourth  discriminant  functions  are  non-zero  and  need  be  computed.  These 
gradient  computations  are  depicted  by  the  thick  gray  arrows  of  figure  D.9,  which  point  back  towards  the 
classifier’s  input.  Note  that  once  X^  becomes  a  learned  example  (definition  D.2),  5r(X^|@)  exceeds 
and  the  derivative  of  CFM  with  respect  to  all  outputs  is  zero.  Mathematically, 

^(7(5,(X^|0).V']  =  0  Vr  iffSAX>\9)  >  /V  (D.81) 

When  this  is  the  case,  no  backpropagation  computations  have  to  be  performed  for  X^ .  This  charactenstic  of 
differential  learning  via  synthetic  CFM  results  in  substantial  computational  savings  as  an  increasingly  large 
fraction  of  the  training  sample  is  learned  (see  section  7.5). 

Steepest  ascent  search  —  Because  CFM  is  maximized,  we  use  a  steepest  ascent  search  for  the  optimal 
classifier  parameters; 


(>[*-1-  1]  =  »()(]  -I- 


(CFM(5"|0l*j)) 


II  Vo  (CFM  (5"  I  eW))  I 


•f  o  ■  A9[k-  1] 


A(rp] 


/ 


where  k  is  an  iteration  index. 


(D.82) 


Vo  (CFM(5"l«[k]))  ^  ^  ^  Vo  (<7  [5.(X>|e[itl),V-]  :  W'  =  UJr)  ,  (D.83) 

>=' 

|(Vo  (CFM  (<5"  |®(/tj))||  denotes  the  magnitude  of  the  gradient  Vo  (CFM  («S''|0[JI;j))  ,and  o  ■  A®lil:— 1) 
is  the  ‘  ‘momentum’  ’  term  described  in  ( 1 1 9,  eq.  (9)).  Note  that  the  steepest  ascent  algorithm  of  (D.82)  differs 
from  the  conventional  error  measure-based  steepest  descent  form  of  backpropagation  in  two  ways 

•  The  sign  of  the  step-size  parameter  e  is  positive  for  steepest  ascent,  but  negative  for  steepest  descent. 
Again,  this  is  because  CFM  is  maximized,  whereas  error  measures  arc  minimized. 

•  When  a  =  0  in  (D.82),  the  search  step  size  A9(il;]  has  a  fixed  magnitude  of  £  ,  since  the  equation 
employs  a  normalized  gradient  term.  This  feature  is  essential  to  stable  convergence  ol  the  search 
because  CFM  (5"  1®[<;1) ,  the  CFM  generated  by  the  training  sample  at  iteration  k,  can  have  a  large 
gradient  and  a  large  hessian  on  parameter  space  in  the  vicinity  of  its  maximum.  This  occurs  when 


(Ik  n  M  Luiilklciicc  pammcter  f  is  small  (i.c..  when  ihc  synihclic  CFM  sipmoid  is  sleep).  If  a 
mm  mninali/ed  pnidienl  were  used  in  (D.82),  could  be  very  large,  inducing  a  large  search 

step  precisely  where  ihe  step  should  be  small  (i.e.,  where  CFM  (5"  J  0[k])  has  high  curvature).  See 
section  ,S..1.6.  lor  a  simple,  hypolhclical  scenario  in  which  CF'M  has  a  large  gradient  and  a  large  hessian 
on  parameter  space  in  the  vicinity  of  its  maximum. 

I).6  Source  Code  for-the  Synthetic  CFM  Objective  Function 

^  The  following  ANSI  C  source  code  implements  the  synthetic  CFM  function  (T  [rf ,  I/'’]  describe  above,  along 

with  ii.s  first  and  .second  derivatives.  The  source  code  argument  delta  represents  iS,  and  the  argument 
conf  represents  r/. 


% 


% 


# 


/*..  s  sss:s£ss  sss  sssrssssassBssssKsesAscsscssassftK  BBSiassssBaaBeBsscsa: 


IIOTK.r:  OK  COPYRIGHT:  Copyright  1992  by  John  Beniamin  Han^Jehlre  IT. 

Individuals  may  compile,  copy,  dlsti^lbute,  and  reuse  this  source  code 
with  one  restriction:  this  notice  of  copyright  may  NOT  be  resioved. 
The  copyright  holder  disclaims  any  warranty  of  any  kind, 
expressed  or  implied,  as  to  this  code's  fitness  for  any  specific  use. 


Authcir  , 


J.  B.  Hampshire  II: 


Dst  ij 


3-14-92 


Purpooc 

BBS  a  B  s:  ^ 


Refs 


Mote  :  fc 

(Jit 


Computes  a  synthetic  asymmetric  sigmoidal  function  cfmtdelta,  conf): 

-1  ««  delta  <«  1. 

This  synthetic  function  la  used  as  the  claeslf icatlon  figure  of  merit  ICFH) 
in  its  ’N-monotnniC  form,  described  in  J,  B.  Hampshire  II 'a  Ph.D.  thesis 
of  1993  snd  Hampshire  I,  Wslbel,  IEEE  Trans.  Neural  Networks,  June,  1990.  The 
''discriminant  differential*'  delta  is  the  difference  between  the 
classifier  output  representing  the  correct  clsas  and  the  largest  other  output. 

In  the  case  that  the  claselfier  la  a  single-output  one  12-class  case),  it  Is 
necessary  to  express  delta  as  a  function  of  the  single  classifier 
output.  This  isn't  hard,  but  it  requires  a  little  care. . 

First  and  second  derivatives  of  cfmtdelta,  cent)  are  also  coRfnited. 

The  function  has  one  "confidence*  parameter  Itha  variable  "cent*  in  the 
following  code),  in  addition  to  its  single  argument. 

The  confidence  parameter  le  on  t0,l)i  low  confidence  corresponds  to  a  steep 
sigmoid  lapproachlng  a  step  function),  whereas  high  confidence  corresponds  to 
a  nearly  linear  function  of  delta.  Bach  call  to  cfml),  il.cfml).  end  dd_cfml), 
checks  tc  see  if  the  confidence  parameter  has  changed  since  the  last  call. 

If  It  baa,  cfmSetupl)  la  callad,  and  cfml  )  and  ita  darivatlvaa  ara 
aynthealsed  for  the  new  confidence.  Following  the  re-aynthasia, 
efmidalta,  confl  or  one  of  ita  darivatlvaa  ia  computed. 

CfmSetupl)  is  computationally  expenilva,  but  cfml),  d_cfml),  and  d4_cfml) 
are  computationally  cliaapar  than  tranacandental  tunetlona.  Slnca  the  confldance 
parameter  Is  c)ianged  ralstively  infrequently,  this  aynthetle  function  Is 
on  aversge  vsry  c)iaap  to  evaluate.  The  advantagea  of  tho  aynthatic  femi 
over  cloaed-torm  functions  described  in  the  original  CFH  paper  are  described 
in  detail  in  die  thesia.  — — 

JRH2  FhO  thesis  notes  of  90II14.  920114,  snd  920728. 


..,9 
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Revision:  3-14-92  by  JBH2 -  Although  c£m(delta,  conf)  is,  strictly  speaking,  defined  only 

===***=*  on  [-1.1]  (corresponding  to  classifiers  with  outputs  bounded  on  (0,11),  the  code 

is  written  in  such  a  way  that  is  works  (in  practice)  for  any  classifier  with 
outputs  on  the  real  number  line.  Since  the  theoretical  proofs  pertaining  to  the 
optimality  of  CFM  are  restricted  to  a  rather  specifically  bounded  sigmoid,  there 
are  no  explicit  guarantees  if  you  violate  the  corresponding  constraints  on  the 
classifier  outputs. 

I've  had  no  trouble  with  polynomial  classifiers  (as  an  example),  but  I  can't  be 
sure  that  there  isn't  some  failure  mode  i^en  delta  is  outside  (-1,1). 

7-28-92  by  JBH2 .  Added  second  derivative  function  for  use  with  modified  ' 


•include  <math.h> 
•include  <stdlib.h> 
•include  <stdio.h> 


•define  PI 

M_PI 

/*  3.14158926. . , 

defined  in  /usr/ include/math. h  •/ 

•define  TWO_Pi 

2.0  *  Pi 

•define  HALF_Pi 

M_PI_2 

/•  Pi/2... 

defined  in  /usr/include/math.h  */ 

•define  INFINITY 

1.0e25 

•define  TRUE 

1 

•define  FALSE 

0 

•define  RN 

0.7 

•define  A2 

0.5 

•define  INV_A1 

0.2 

•define  AO 

-0.5 

•define  RP 

(-1.0  *  AO) 

typedef  struct  .MyPoint  ( 

double  X,  y; 

)  MyPoint; 

Static  double  last.conf; 

static  double  an.  rn,  xTnn,  xTn,  inv_ap.  bp,  xTp.  yTp,  rp; 
static  MyPoint  Un.  Up; 

•  void  getCfmBreakpoints ( ) :  Returns  the  values  of  delta  marking  the  lower  and  upper  boundaries 

•  of  the  synthetic  CFM  function's  upper  radius. 

• 

•  Parameters;  I  lower  bound  of  the  synthetic  CFM  function's  upper  radius  passed  to 

•  calling  routine  via 

•  this  pointer. 

•  u  upper  bound  of  the  synthetic  CFM  function's  upper  radius  passed  to 

•  calling  routine  via  this  pointer. 

• 

*  Returns;  nothing 

. 

*  Notai  it 

*  Latest 

*  Revision;  1-20-93  by  JBH2 .  Mded.  There's  no  point  bacicprop' Ing  on  deltas  that  exceed 

*  the  upper  radius,  since  the  synthetic  function  has  zero  slope  beyond  this 

*  point.  This  function  gets  celled  every  time  the  confidence  parameter  gets 

*  changed,  and  it  updates  the  bounds  in  the  calling  routine. 

* 

void  getCfmBreakpolntsd.  u) 
double  *1.  *u. 

( 

•1  »  XTp; 

*U  =  Up.x; 
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return; 

) 

* 

*  void  d_cfm():  Returns  the  synthetic  CFM  function's  first  derivative  wrt  delta,  given 

*  delta  and  conf. 

* 

*  Parameters:  delta  the  classifier's  output  differential. 

*  conf  the  CFM  confidence  parameter. 

* 

*  Returns:  the  synthetic  CFM  function's  first  derivative  wrt  delta. 

« 

*  Notes  h 

*  Latest 

*  Revision:  7-28-92  by  JBH2 , 

* 

double  d_cfri(delta,  conf) 
double  delta,  conf; 

{ 

double  d_cfm,  diff; 
void  cfmSetupO; 

/•  1.  don't  allow  confidence  to  go  below  .01  ♦/ 

if (conf  <  .01) 
conf  =  .01; 

/•  2.  if  the  present  confidence  isn't  the  same  as  the  last  one, 

set  up  the  synthetic  function  anew, 

♦/ 

if (conf  1=  last_con£) 

cfmSetup(conf .  fcUn,  (.an,  (<m,  ix'Tnn,  fcxm,  Hnv_ap,  (bp,  fcxlp,  iyTp,  (.Up,  (.rp) ; 

/*  3.  compute  the  value  of  the  objective  function  •/ 

if(delta  Up.x) 
d_cfm  =  0.0; 
else  if  (delta  >  x'Tp)  ( 
diff  =  delta  -  Up.x; 

d_cfm  *  -diff  /  s<jrt(rp*rp  -  diff'diff); 

) 

else  if(delta  >=  xTn) 
d_cfm  •  1.0/inv_ap; 
else  if(delta  >  xTnn)  { 
diff  «  delta  -  Un.x; 

d_cfn  =  diff  /  8qrt(m*m  -  dlff'diff); 

) 

else 

d_c£m  *  an; 

/*  4.  stash  the  currant  confidence  value,  and  return  d_cfm  •/ 

last_conf  »  conf; 
return (d_cfm)  ,- 

) 

* 

•  void  dd_cfm():  Returns  the  synthetic  CFM  function's  second  derivative  wrt  delta,  given 

•  delta  and  conf. 

•  Parameters  delta  the  classifier's  outpvt  differential 

•  conf  the  CFM  confidence  parameter. 
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Returns: 

Notes  & 

Latest 

Revision 


double  dd_c£in(delta,  con£) 
double  delta,  con£; 

{ 

doxible  dd_c£Tn,  dl££,  dtemp; 
void  c£nSetup(); 

/*  1.  don’t  allow  con£idence  to  go  below  .01  */ 

i£(con£  <  .01) 
con£  *  .01; 

/*  2.  i£  the  present  con£idence  isn't  the  same  as  the  last  one, 

set  up  the  synthetic  £unction  anew. 

•/ 

l£(con£  !=  last_con£) 

c£mSetup(con£,  iUn,  tan,  fcrn,  txTnn,  fcxTn,  fcinv_ap,  tbp,  txTp,  tyTp,  tUp,  trp) ; 

/*  3.  conpute  the  value  o£  the  objective  £unction  */ 

i£ (delta  >=  Up.x) 
d<JLc£io  =  0.0; 
else  i£(delta  >  xTp)  ( 
dl££  «  delta  -  Up.x; 
dtemp  «  l.O  /  8qtt(rp*rp  -  di££*di££); 
dd_c£m  «  (-di££  *  diE£  *  dtemp  •  dtemp  -  1.0)  *  dtemp; 

) 

else  i£(delta  >=  xTn) 
dd_c£m  »  0.0; 
else  i£ (delta  >  xTnn)  ( 
dl££  »  delta  -  Un.x; 

dtenv  =  1.0  /  8qrt(rn*m  -  dif£‘di££): 

dd_c£m  *  (di££  *  di££  •  dtemp  •  dtemp  ♦l.O)  ♦  dtemp; 

) 

else 

dd_c£m  =  0.0; 

/*  4.  stash  the  current  con£idence  value,  and  return  dd_c£m  */ 

last_con£  =  con£; 
return (dd_c£m) ; 


void  c£m();  Returns  the  synthetic  CFM  Eunction's  value,  given  delta  and  conE. 

Parameters:  delta  the  classiEier's  output  dlEEerential . 

conE  the  CFM  conEidence  parameter. 

Returns:  the  synthetic  CFM  Eunction's  value. 

Notes  (i 
Latest 

Revision:  3-14-92  by  JBH2 . 


the  synthetic  CFM  Eunction’s  second  derivative  wrt  delta. 


7-28-92  by  JBH2 . 


double  c Em (delta,  conE) 
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double  delta,  conf; 

1 

double  cfm.  diff; 
void  c£mSetup( ) ; 

/*  1.  don't  allow  confidence  to  go  below  .01  *i 

if(conf  <  .01) 
conf  =  .01; 

/•  2.  if  the  present  confidence  isn't  the  same  as  the  last  one. 

set  up  the  synthetic  function  anew. 

♦/ 

if (conf  !=  last_conf) 

cfraSetup(conf .  tUn,  fcan,  &m.  txTnn,  fcxTn.  Sinv_ap.  tibp,  fcicTp,  4yTp,  fcUp,  trp)  ; 

/*  3.  compute  the  value  of  the  objective  function  */ 

if (delta  >=  Up.x) 
cfm  =  1.0; 

else  if (delta  >  xTp)  { 
diff  =  delta  -  Op.x; 

cfm  *  Op.y  *  sqrt(rp*rp  -  diff'diffi; 

) 

else  if (delta  >=  xTn) 

cfm  =  delta/inv_ap  ♦  bp; 
else  if (delta  >  xTnn)  ( 
diff  =  delta  -  Un.x; 
cfm  «  Un.y  -  8<irt(m*rn  -  diff'diff); 

) 

else 

cfm  =  an  •  delta  ♦  an; 

/*  4.  stash  the  current  confidence  value,  and  return  cfm 

last_conf  «  conf : 
return (cfm) ; 

) 


void  cfmSetup():  Recomputes  all  the  necessary  metrics  for  the  synthetic  CFM  function, 

given  the  new  confidence  parameter  conf. 


Parameters : 


conf  the  new  CFM  confidence  parameter. 

On  the  centroid  of  the  function's  lower  radius, 

an  the  slope  of  the  function's  lower  leg. 

m  the  function's  lower  radius. 

xTnn  the  value  of  delta  at  which  the  function's  lower  leg  and  lower 
radius  are  tangent. 

xTn  the  value  of  delta  at  which  the  function's  lower  radius  and 
(middle)  transition  leg  are  tangent. 
inv_ap  the  inverse  of  the  transition  leg's  slope. 

bp  the  vale  of  delta  at  which  the  transition  leg  intercepts  the 

horizontal  lino  CFM  °  0. 

xTp  the  value  of  delta  at  which  the  function's  (middle)  transition 
leg  and  upper  radius  arc  tangent . 
yTp  the  value  of  CFM  at  delta  «  xTp. 

Up  the  centroid  of  the  function's  upper  radius, 

rp  the  function's  upper  radius. 


Returns ; 


nothing 


Notes  fc 
Latest 
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•  Revision:  3-14-92  by  JBH2 . 

« 

void  cfmSetup(con£,  Un,  an,  rn,  xTnn,  xTn,  inv_ap,  bp,  xTp,  yTp,  Up,  rp) 
double  conf ; 

double  ‘an,  *m,  *xTnn,  *xTn,  *inv_ap,  *bp,  *xTp,  *yTp,  •rp; 

MyPolnt  *0n,  ‘Up; 

1 

static  double  fconf,  angle,  RR,  angle_l,  angle_2,  angle_3,  angle_p; 

static  double  2eta_n,  1,  D,  al,  xO,  yO,  xl,  yl.  arg; 

static  MyPoint  Tnn,  Tn,  UO; 

static  int  virgln=TRUE; 

/•  phase  1  and  2  are  computations  are  all  constants,  so  do  them  only  once.  */ 

if (virgin)  { 

. . . 

•  PHASE  1  • 

. . 

angle_2  =  atan(A2); 
if(INV_Al  «.  0.0) 
angle_l  =  HALF_Pl; 
else  ( 

al  =  1.0  /  INV_A1; 
angle_l  ^  atan(al),' 

) 

/•  Notes  of  920314,  (1)  •/ 

angle_3  =  MALF_Pi  -  (angle_l  -  angle_2)  /  2.0; 

/•  Notes  of  ®'.0314,  (2)  ♦/ 

1  «■  RN  /  tan(angle_3) ; 

/*  Notes  of  920314,  (3)  •/ 

if(INV_Al  .«  0.0)  { 
xO  «  0.0; 
yO  =  A2; 

) 

else  ( 

xO  >  A2  /  (al  -  A2); 
yO  «  al  •  xO; 

) 

/•  Notes  of  920314,  (6)  •/ 

arg  =  HALF_Pi  -  angle_l; 
xl  •  xO  ♦  1  •  8in(arg); 
yl  •  yO  ♦  1  •  cos (arg); 

/*  Notes  of  920314,  (7)  •/ 

OO.x  ■  xl  -  RN  •  cos(arg); 

00, y  ■  yl  ♦  RN  •  sin(arg); 

. . . 

•  PHASE  2  • 

. . . 

/•  Notes  of  920314,  (8)  •/ 

RR  •  sqTt(U0,x*U0.x  ♦  OO.y'UO.y); 

/•  Notes  of  920314,  (8a)  ♦/ 

angle  >  atan(fab8(U0.y/U0.x) ) ; 
virgin  «  FALSE; 


/ 
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•  PHASE  3  * 

•*••**•***•**/ 

fconf  =  conf; 

/*  Notes  of  920314.  (11)  •/ 

*m  =  RN  *  fconf; 

2eta_n  =  RR  *  fconf; 

Un->x  =  -zeta_n  *  cos (angle) ; 

Un->y  =  ieta_n  *  sin(angle): 

/•  Notes  of  920314,  (12)  •/ 

*rp  =  RP  •  fconf; 

Up-»x  =  -‘rp  /  AO; 

Up-»y  =  1  -  ‘rp; 

. . . 

•  PHASE  4  • 

. . . 

/•  Notes  of  920314,  (13)  •/ 

1  «  S(irt((Un->x  ♦  1.0)  *  (Un->x  ♦  1.0)  ♦  On->y  *  On->y) ; 

D  »  a<irt(l*l  -  ‘m  *  *m); 

ongle_3  =  atan(Un->y/ (Un->x  ♦  1.0))  -  a8in(*m  /  1); 

/•  Notes  of  920314,  (14)  •/ 

•an  »  tan(angle_3) ; 

/•  Notes  of  920314,  (15)  •/ 

•xTnn  ■  D  •  cos(angle_3)  -  1.0; 

Tnn.y  «  D  •  sin(angle_3) ; 

•  PHASE  5  • 

/•  Notes  of  920314,  preceeding  (16)  */ 

D  =  sqrt((Up->x  -  Un->x)  •  (Up->x  -  Un->x)  ♦  (Up->y  -  Un->y)  *  (Up->y  -  Un->y)  )  ; 
1  *  s<irt(D*D  -  ('tp  ♦  •m)*(*rp  ♦  *rn)); 

angle_3  «  acosd/D); 

angle_2  =  atan((Up->y  -  Un->y) / (Up->x  -  Un->x) ) ; 

/•  Notes  of  920314.  (16)  •/ 

angle_p  »  angle_3  ♦  angle_2.- 

if(anglej>  «»  HALF.Pi) 

•lnv_ap  «  0.0; 

else 

•lnv_ap  •  1.0  /  tan(angle_p) ; 

. . . 

•  PHASE  6  * 

. ....../ 

/•  Notes  of  920314,  (18)  •/ 

angle_l  «  angle_p  .  HALF_P1; 

•xTp  •  Up->x  ♦  ‘rp  •  cos(angle_l) ; 

•yTp  «  Op->y  ♦  *rp  •  sln(angla_l ) ; 

/•  Notes  of  920314,  (19)  •/ 

•xTn  «  Un->x  -  ‘m  •  cos(angle_l) ; 

Tn.y  «  Wn->y  -  *m  •  sin(angle_l )  ; 

/•  Notes  of  920314,  (19)  •/ 


lf(angle_p  •.  HALF_P1) 
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•bp  =  -INFINITY; 
else 

♦hp  «  ‘yTp  -  *xTp  /  *inv_ap; 
return; 


Appendix  E 


Differential  Learning  via  CFM  Viewed 
as  a  Generalization  of  Learning  via 
Rosenblatt’s  Perceptron  Criterion 
Function 


In  this  appendix  we  explore  the  similarities  between  differential  learning  via  the  CFM  objective  function 
and  learning  via  Rosenblatt's  perceptron  criterion  function  (116).  We  prove  that  differential  learning  and 
perceptron  learning  are  quite  similar  for  the  2-class  pattern  recognition  task  in  which  the  classifier  has  one 
linear  discriminant  function.  We  begin  the  proof  with  a  differential  learning  formulation  of  the  task;  we  then 
alter  the  form  of  the  CFM  objective  function  and  complete  the  proof. 

Since  the  pattern  recognition  task  is  a  =  2-class  task,  we  need  a  discriminator  with  only  one 
discriminant  function  gi(X|9):  if  gi(^l^)  is  positive,  the  classifier  labels  X  as  an  example  of  class  U)\ ; 
if  gi(X  I B)  is  negative,  the  classifier  labels  X  as  an  example  of  class  .  However,  we  assume  a  classifier 

of  the  form  described  in  section  2.2. 1 ,  which  obliges  us  to  create  a  second  phantom  discriminant  function 
g2(X  1 0) .  This  phantom  discriminant  function  is  related  to  gi(X  1 9)  by 


g2(X|»)  =  -g,(X|0) 


(E.I) 


so  that  the  resulting  classifier’s  operation  is  described  by  (2.6)  and  (2.7).  The  two  discriminant  functions  are 
constrained  to  be  linear  functions  of  X .  If  we  adopt  the  convention  of  [29,  ch.  S]  and  define  the  augmented 
feature  vector  X'  as  an  (/V  I  )-element  vector  formed  by  preceding  tbe  original  A/-element  vector  with  a 
single  element  of  constant  unit  value 
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Figure  E.l;  A  classifier  comprising  a  single  linear  discriminant  function  is  equivalent  to  Rosenblatt's 
perceptron  when  generated  with  this  modified  form  of  the  CFM  objective  function. 


and  we  give  the  parameter  vector  9  (A^  +  1)  elements,  then  the  two  linear  discriminant  functions  are 
described  by 


g,(X|»)  = 

gi{x\e)  =  -x'^e 


(E.3) 


(the  notation  denotes  the  transpose  of  the  vector  Z ).  The  discriminant  differentials  associated  with 
gi(X|fi)  and  g2(X|®)  are  therefore 


5,(X|»)  =  g,(Xio)  -  g2[X\9)  =  ix'^e 
<52(X1»)  =  g2(X|0)  -  g,(X|<?)  =  -2X'"9 


■  (B.4) 


Recall  from  section  2.2.4  that  the  argument  of  the  CFM  objective  function  associated  with  a  training 
example  is  Sr  when  the  class  label  of  (he  (raining  example  is  U>t  ■  Let  us  change  the  functional  form  of 
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the  CFM  objective  function  from  the  sigmoidal  function  <7  [<5 ,  i. ']  described  in  sections  2.2.4  and  2.4  and 
appendix  D  to  the  piece-wise  linear  function 


a  [tS]  = 


(5  >  0 
(^  <  0 


(E.5) 


illustrated  in  Figure  E.  I .  Under  these  circumstances,  the  average  CFM  for  the  training  sample  <S"  of  size  n 
is,  by  (2.81)  and  (2.82), 


CFM  (5"  j  0) 


1  ^  (7[<S,(X>|e)l  ; 


<^2(X>|»), 


VVV  = 


Wi  =  UJ2 


/=i 


f  ■  T 

Xt'  0, 
-X^'^0, 
0. 


=  0)1  &  Xi'^0  <  0 
W>  =  0)2  &  X^'^e  >  0  (E-6) 
otherwise 


Maximizing  CFM  {S"  1 0)  over  0  is  equivalent  to  minimizing  -CFM  (5"  1 0)  over  0 


max  CFM  (S”  1 0)  = 

0 


min 

0 


f  ■  T 

-X^'  0. 

W-'  =  (jl)\ 

&  <  0 

^(r;(x^);  s"(x>)  =  i 

X>'^0, 

wi  =  u;2 

&  X^'^0  >  0 

0, 

otherwise 

V 

Rosenblatt's  Perception  Criterion  Function 


(E.7) 


Equation  (E.7)  is  —  but  for  a  constant  —  identical  to  Rosenblatt’s  perceptron  criterion  function  (cf.  (12)  of 
(29,  ch.  5]).  Thus,  differential  learning  via  CFM  can  be  viewed  as  a  generalization  of  the  perception  approach 
to  discriminative  learning.  The  generalization  is  four-part; 

1.  Learning  is  extended  from  the  2-class  pattern  recognition  task  to  the  general  C  >  2-class  task. 

2.  The  functional  form  of  the  classifier’s  discriminant  functions  need  not  be  linear.  The  removal  of  this 
restriction  allows  differential  learning  to  be  applied  to  any  differentiable  supervised  classifier  (see 

table  2. 1  for  some  examples). 


362 


Appendix  E:  Differential  and  Perceptron  Learning 


3.  The  functional  form  of  the  CFM  obje>.ti  ve  function,  described  in  sections  2.2.4  and  2.4  and  appendix  D, 
guarantees  that  the  classifier  will  be  asymptotically  efficient  (the  focus  of  the  entire  text).  By  the  proof 
of  section  2.4,  the  functional  form  of  Rosenblatt’s  perceptron  criterion  function  (represented  as  a  CFM 
objective  function  in  figure  E.  I )  lacks  the  sigmoidal  shape  necessary  f,  i  engendering  minimum-error 
discrimination.  When  the  number  of  classes  is  greater  than  two  and/or  the  class-conditional  densities 
of  X  overlap  (i.e.,  the  concepts  to  be  learned  are  stochastic),  differential  learning  via  the  perceptron 
criterion  function  of  figure  E.  I  is  provably  not  asymptotically  efficient;  differential  learning  via  CFM 
is. 

4.  If  the  class-conditional  densities  of  X  (C  =  2)  are  linearly  separable,  the  linear  discriminant 
generated  with  the  perceptron  criterion  function  is  guaranteed  to  separate  the  two  classes  of  X .  The 
differential  learning  guarantee  associated  with  the  (C  >  2)-class  X  having  potentially  overlapping 
class-conditional  densities  is  analogous,  albeit  considerably  stronger;  the  differentially-generated 
classifier  is  guaranteed  to  yield  the  lowest  error  rate  possible.'  given  an  asymptotically  large  training 
sample. 


'  The  lower  bound  on  the  classifier's  error  rate  is  determined  by  how  well  its  discriminator  can  approximate  the  Bayesian  discriminant 
function.  Thus,  when  we  st.ite.  "the  lowest  error  rate  possible."  we  mean  the  lowest  possible  given  our  particular  choice  of  discriminator. 
not  the  lowest  possible  given  nny  choice  of  discriminator. 


Appendix  F 

Proper  Parametric  Models  of  the 
Homoscedastic  Gaussian  Feature 
Vector ' 


In  this  appendix  we  replicate  two  proofs:  the  normal-based  linear  discriminant  analysis  paradigm  is  the 
fiilly-parametric  proper  model  for  the  feature  vector  X  with  homoscedastic  Gaussian  class-conditional 
pdfs;  the  logistic  regression  paradigm  (a.k.a.  logistic  discriminant  analysis)  is  the  partially-parametric 
proper  model  for  the  feature  vector  X  with  equal  class  prior  probabilities  and  homoscedastic  Gaussian 
class-conditional  pdfs.  The  first  proof  can  be  found  in  atiy  introductory  textbook  on  probability  and  statistics, 
since  the  fully-parametric  model  learns  by  computing  the  maximum-likelihood  estimates  for  the  means  and 
covariance  matrices  of  the  feature  vector’s  class-conditional  pdfs.  The  second  proof  has  been  worked  by 
Akaike  and  While  (2,  140,  142,  141)  and  by  Hjon.^ 

The  proofs  require  that  the  class-conditional  covariance  matrices  are  all  of  the  form  •  J,  where  I 
denotes  the  identity  matrix.  Under  a  simple  linear  transformation,  a  feature  vector  with  homoscedastic 
Gaussian  class-conditional  pdfs  that  are  not  of  this  form  will  be  transformed  to  one  with  class-conditional 
pdfs  that  are  of  this  form.  We  assume  such  a  transformation  has  been  performed  without  loss  of  generality. 
Given  this  assumption,  the  homoscedastic  restriction  on  the  class-conditional  pdfs  of  X  ensures  that 
class  boundaries  on  X  piece-wise  linear  because  all  the  class-conditional  covariance  matrices  have 

orthonormal  eigenvectors  and  eigenvalues  that  all  equal  . 


'  A  homoM;edastic  feature  vector's  class-conditional  prohability  density  functions  (pdfs)  all  have  the  same  covariance  matrix. 
^Hjort's  proof  is  contained  in  |6.S].  which  we  have  not  obtained;  the  reference  is  mentioned  in  [68.  pg.  169). 
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F.l  The  Fully-Parametric  Proper  Model 

Consider  (he  ^-dimensional  feature  vector  X  with  Gaussian  class-conditional  pdfs; 


(2rr)?  \E.\^ 


exp 


--(X  -  '(X  -  M,) 


(F.l) 


As  described  above,  we  assume  that  X  is  homoscedastic,  and  —  without  further  loss  of  generality  —  that 
it  has  undergone  a  linear  transformation  such  that  all  its  class-conditional  covariance  matrices  are  given  by 
I.  Under  these  conditions,  (F.  I )  reduces  to 


/’x|w(Xla;,,M,,'^^)  =  (27r)'f  ffN 


2a^ 


(X  -  -  M,) 


(F.2) 


By  Bayes’  rule,  the  a  posteriori  class  probabilities  of  X  are  given  by 


F)V|x((^(|X,|i,,<r^)  = 


/3x|>v(X|a^,,M,,g^)  •  Pvv(Lc>.) 


X  /5xnv(X|t«2,.M/.<^^)  •  Pvv((*2,) 


tH.3) 

(F.4) 


In  the  form  of  (2.21),  let 


r,((X>,H’>))  = 


I ,  W'i  =  U, 

0,  otherwise 


(F.5) 


If  we  view  the  right-hand  side  of  (F.4)  as  the  basis  for  (he  likelihood  equation,  given  the  training  sample 
of  n  independently-drawn  training  example/class  label  pairs  S"  =  {(X '  ,W (X" ,  W")},  the 
log-likelihood  equation  is 


r,((X',vv')) 


L(mi . = 

/=!  /=! 

=  E  E  [7•.((X^VV^»  •  (in  (P„(a),))  +  ln(^x|tv(X'|a;t.A,  .^')))] 

/=!  /=! 

ln(Pu(u;,))  -  ln((27r)*)  -  N  ■  In  (v^) 


=  EE 

/=!  *=l 


r,((XMV'}) 


-  ^  .  (X>  -  -  M,) 


(F.6) 
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i 


i 


i 
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where  X-'  denotes  the  7th  example  of  X  in  the  training  sample,  W-'  denotes  the  class  label  of  X-',  /t, 
is  the  maximum-likelihood  estimate  of  the  class-conditional  mean  /r,,  and  is  the  maximum-likelihood 
estimate  of  the  variance  parameter  <r^. 

Setting  the  gradient  of  (F.6)  with  respect  tc  {/T, , . . .  .  cr^}  equal  to  zero  and  solving  the  resulting 

normal  equations  gives  us  the  maximum-likelihood  parameter  estimates 

i  ^  r,((X>,W’>»  •  X> 

(F.7) 

(=1  j=i 

where  «,  denotes  the  number  of  examples  in  S"  with  the  class  label  LUj.  These  estimates  are  used  in  (F.4), 
along  with  sample-based  estimates  of  the  class  prior  probabilities  {P»v(t^i ) .  .  •  •  ,Pvv(t<7c)},  to  produce  the 
the  fully-parametric  proper  model  of  X.  The  process  is  commonly  called  normal-based  linear  discriminant 
analysis  (e.g.,  [91]). 

F.2  The  Partially-Parametric  Proper  Model 

The  a  posteriori  class  probabilities  in  {F.4)  can  be  expressed  as  follows: 


Pn-|x(C^.|X,/r,,<7^) 


1=1 


-I 


(F.8) 


F.2.1  C  =  2:  Logistic  Regression 


When  C  =  2,  X  represents  one  of  two  classes.  If  both  classes  have  prior  probabilities  of  ^,the  a  posteriori 
class  probabilities  of  X  are 


Pwix(^i  |X,/i|  ,0^) 


* 

■ 

• 

= 

1  -F  exp 

^(M2  -  Mi)^X  -f 

0^  P 

(F.9) 


366 


Appendix  F:  Proper  Gaussian  Parametric  Models 


and 


—  I  —  P»V|x(^l  I  X  ,^l 

(F.IO) 


I  +  exp 


MiT X  -  Y^(p]p\  -  M2M2) 


od 


Note  that  the  N-dimensional  vector  ot  and  the  scalar  /j  can  be  viewed  as  the  ultimate  parameters  of  this 
partially-parametric  model,  so  we  can  make  the  following  equations; 


P»v|x(tt’i  |X,/X|  ,<T^)  =  Piv|x(^i  |X,a,/3)  =  [l  +  exp[a^X  +  /?]]  ' 

(F.ll) 

Pw|x(t^2|X,/i2,<72)  =  P,v|x(Cfc^2|X,a,/3)  =  [1  +  exp[-a^X  -  /?]]  ' 

When  the  method  of  maximum-likelihood  is  used  to  estimate  the  parameters  a  and  /? ,  it  is  modified 
to  maximize  a  product  of  independent  a  posteriori  class  probabilities  rather  than  a  product  of  independent 
class-conditional  pdf  terms.  Maximizing  the  logarithm  of  this  product  is  equivalent  —  a  procedure  called 
maximizing  the  logit  of  risk  (83,  pp.  80-82J.  The  model  of  (F.ll)  is,  by  definition  3.13,  a  partially- 
parametric  proper  model  of  X.  Assuming  n  independently  drawn  training  example/class  label  pairs 
5"  =  {(X',W') . (X"  .W")} ,  the  logit  risk  function  is  (e.g.,  [68,  pg.  9]) 


L 


In  (Piv|x(Ci^i|X>.a 


..  xr,«X>,vv')) 


ii 


•) 


f  ~  >.r,((X>,vv>>( 

•  {P^v•|x(u;2|X^5./3)) 


n 

^[r,{(X>,VP^»  •  ln(Pvv,x(Cc;i|X^a.5)) 

;=i 

+  T2{{Xi,Wi))  .  ln(Pw|x{t*^2|X^a,3'))]  ,  (F.I2) 


where  7',({X^  ,  W’^))  is  given  in  (F.6),  and  5  and  /?  denote  €j/fmafc5  of  a  and  0.  Equation  (F. 1 2)  can 
be  differentiated  with  respect  to  its  N  -|-  I  parameters  in  order  to  generate  normal  equations.  The  resulting 
equations  are  non- linear  with  respect  to  the  parameters,  so  they  must  be  solved  iteratively.  We  omit  the 
normal  equations  because  they  are  not  essential  to  our  argument;  see  [68]  as  an  example  of  such  details. 

If  we  view  the  estimated  a  posteriori  class  probabilities  in  (F.  1 2)  as  discriminant  functions,  and  we  use 

the  notation  0  =  la  ,fi\  ,  then 


F.2 
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/?,(X|<?)  =  Pvv|x{a^/|X>.a.,5)  (F.I3) 

and  (F.  1 2)  can  be  re-slated  as 


;=i 

The  reader  should  recognize  that  this  form  of  L  is  (but  for  a  constant  factor,  cf.  section  2.3.2) 

the  Kullback-Leibler  information  distance  [82,  81]  of  the  training  sample,  given  the  discriminant  functions 
gi  (X I  and  g2(X  1 0) .  Thus,  the  maximum-likelihood  parameters  of  the  logistic  regression  model  are 
obtained  by  minimizing  the  Kullback-Leibler  information  distance  between  the  training  sample  and  the 
discriminator  ^(X|»)  =  {gi(X|«).g2(Xld)},  where  g,(X|»)  is  given  by  (F.I3).  By  the  proof  of 
section  2.3.2,  this  learning  strategy  leads  to  the  following  parameterization  for  large  training  sample  sizes; 

5  =  a  =  is  (1^2  -  P\) 

lim  _  (Pi  5) 

P  =  i3  =  ^  (|i]|x^  -  n[n2) 

F.2.2  C  >2:  Logistic  Discriminant  Analysis 

If  the  class  prior  probabilities  are  all  equal,  then  (F.8)  simplifies  to 


Pn|x(0^i|X.M,.<^^) 

I  -I-  53  exp  -  #^»rx  -b  -  mX)] 

tsi  *“  ^ 

When  C  >  2 ,  (F.  1 2)  assumes  the  more  general  form 

L  ^ai , . . .  ,  Or ,  5| , . . .  .  = 

In  I  n  n  (Ftv|x(tt^i|X^,5,, ;J,)j 

\;=l  1=1 


=  iZT,  [7-/({X^W^))  •  In  (p„-,x(W/|X>.a,./j,))] 
7=1  1=1 


(F.16) 


(F.I7) 
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A  single  logistic  discriminant  function  of  the  form 

^,(X|e)  =  [l  +  exp[aJX  +  B,]Y'  (F.18) 

is  used  to  model  each  of  the  C  a  posteriori  probabilities  of  X.  In  order  for  g,(X|^)  to  be  a  reasonably 
good  approximation  of  the  a  posteriori  probabilities  given  in  (F.l  1),  each  class  conditional  mean  /i,  must 
have  only  one  neighboring  mean  (which  we  will  denote  by  /i,> )  closer  than  about  3(r .  Under  this  condition, 
each  class-conditional  pdf  of  the  feature  vector  has  only  one  close  neighbor.  That  is,  each  class  is  confusable 
with  only  one  other,’  and  the  rth  a  posteriori  probability  of  (F.  1 1 )  is  reasonably  well  approximated  by  the 
logistic  function 


P,v|xMi 


I  -I-  exp 


75  iPe  -  X  -I-  -  PlPr) 


a.T 


(F.l  9) 


By  the  same  arguments  as  those  of  the  preceding  section,  ^(X 1 0)  =  {gi(X  |  S) , . . .  ,gc(X  |  ^)}  will 
be  a  reasonable  approximation  to  the  proper  parametric  model  of  X .  If  the  model  learns  by  minimizing 
its  Kullback-Leibler  information  distance  with  the  training  sample,  the  resulting  parameters  {ai,  ...  ,ac} 
and  {/i| ,  . . .  ,  He }  will  be  maximum-likelibood  estimates  of  their  true  values: 

a,  =  a,  =  ^  -  |i,) 

lim  (F.20) 

/},  =  Hi  = 


FJ  The  Asymptotic  Relative  Efficiency  of  Logistic  Discriminant  Anal¬ 
ysis  Versus  Normal-Based  Linear  Discriminant  Analysis 

Efron  studies  the  asymptotic  relative  efficiency  (ARE)  (see  section  3 .6. 1 )  of  the  fully-  and  partially-parametric 
proper  models  for  the  homoscedastic  Gaussian  feature  vector  in  [30].  We  remind  the  reader  that  normal-based 
linear  discriminant  analysis  is  the  fully-parametric  model  and  logistic  discriminant  analysis  is  the  partially- 
parametric  model.  Efron’s  definition  of  ARE  is  based  on  the  ratio  of  the  fully-parametric  model’s  error  rate 
to  that  of  the  partially-parametric  model.  Our  definition  of  ARE  (definition  3.18)  is  based  on  the  ratio  of 

We  subjectively  characterize  two  cla.ss-conditional  pdfs  as  “close"  neighbors  if  their  means  are  separated  by  less  llran  three  standard 
deviations.  If  we  were  to  set  a  more  rigorous  standard  for  closeness  — say,  fis-e  standard  deviations  — the  approximation  of  (F.  19) 
would  be  so  much  the  better. 
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one  classifier’s  MSDE  to  another  classifier’s  MSDE:  we  generally  assume  the  two  classifiers  differ  only  in 
terms  of  the  learning  strategy  they  employ,  although  a  simple  notational  generalization  of  definition  3.18 
allows  for  a  comparison  of  classifiers  with  different  hypothesis  classes.  Our  definition  therefore  has  a  similar 
philosophical  motivation  to  Efron’s,  but  it  focuses  on  a  comparison  of  MSDE  (si/Horer/ discriminant  bias 
plus  discriminant  variance),  whereas  Efron’s  definition  focuses  only  on  a  comparison  of  discriminant  bias. 

These  differences  notwithstanding,  Efron’s  work  proves  for  the  C  =  2-class  case,  (fully-parametric) 
normal-based  linear  discriminant  analysis  is  more  efficient  than  (partially  parametric)  logistic  discriminant 
analysis.  A  similar  analysis  of  the  C  >  2-class  case  can  be  found  in  [19].  The  reason,  stated  in  intuitive 
terms,  is  that  the  fully-parametric  paradigm  is  a  more  constrained  model  of  the  data,  despite  its  having  more 
parameters.  The  class-conditional  means  are  explicitly  modeled  in  the  fully-parametric  paradigm,  whereas 
only  the  difference  between  these  means  is  modeled  in  the  partially-parametric  paradigm.  The  higher  degree 
of  specificity  in  the  fully-parametric  proper  model  makes  it  more  efficient  by  Efron’s  definition;  again,  by 
our  definition  of  discriminant  efficiency,  Efron’s  work  proves  that  the  fully-parametric  model  exhibits  lower 
discriminant  bias  than  its  partially-parametric  counterpart:  no  statement  is  made  concerning  discriminant 
variance. 

The  greater  specificity  of  the  fully-parametric  model  is  an  advantage  when  it  is  indeed  proper  (i.e., 
when  the  assumptions  regarding  the  probabilistic  nature  of  X  are  valid),  but  it  is  a  disadvantage  when 
the  model  is  improper.  Specifically,  the  fully-parametric  model  described  in  this  appendix  is  proper  only 
for  homoscedastic  Gaussian-distributed  feature  vectors',  whereas  the  partially-parametric  model  described 
herein  is  proper  for  a  much  broader  family  of  homoscedastic  exp<7nenrtfl//v-distributed  feature  vectors  (e.g., 
[83]).  Thus,  if  the  feature  vector  is  exponentially-distributed  rather  than  Gaussian-distributed,  the  partially- 
parametric  model  will  be  proper  (and  efficient),  whereas  the  fully-parametric  model  will  be  neither  proper 
nor  efficient.  This  phenomenon  leads  us  full-circle  to  the  arguments  of  chapter  3;  if  we  assume  that  the 
feature  vector  is  arbitrarily-distributed,  then  neither  the  fully-  nor  the  partially-parametric  model  is  proper, 
so  the  most  efficient  classifier  possible,  given  the  improper  logistic  linear  hypothesis  class  implied  by  both 
parametric  models  described  herein,  will  be  generated  by  the  differential  learning  strategy. 

F.4  The  Proper  Parametric  Model  Constraints  are  Severe 

Given  the  preceding  insights,  the  choice  of  learning  strategy  hinges  on  whether  the  model  of  the  data  is 
proper.  In  order  for  the  (fully-parametric)  normal-based  linear  discriminant  analysis  paradigm  to  be  a  proper 
parametric  model  of  the  C  >2-class  Gaussian  feature  vector  X ,  the  class-conditional  pdfs  of  X  must  be 
homoscedastic. 

In  order  for  the  (partially-parametric)  logistic  discriminant  analysis  paradigm  to  be  a  proper  parametric 
model  of  the  2-class  exponentially-distributed  feature  vector  X ,  the  two  class-conditional  pdfs  of  X  must  be 
homoscedastic  and  the  class  prior  probabilities  must  be  equal.  In  order  for  the  logistic  discriminant  analysis 
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paradigm  to  be  a  proper  parametric  model  of  the  C  >2-class  exponentially-distributed  feature  vector  X ,  all 
the  class-conditional  pdfs  of  X  must  be  homoscedastic;  furthermore,  no  class-conditional  pdf  can  be  a  close 
neighbor  in  ^  of  more  than  one  other  pdf,  and  the  class  prior  probabilities  must  be  equal. 

These  constraints  on  the  form  of  X  are  strong,  and  in  reality  they  rarely  hold.  When  they  do  hold,  it  is 
usually  in  the  context  of  a  deterministic  feature  vector  that  is  corrupted  by  independent  additive  Gaussian 
noise  with  these  nice  properties.  The  resulting  “random”  feature  vector  can  be  modeled  quite  well  by 
either  of  the  proper  models  described  herein.  Classical  hypothesis  testing  procedures  (e.g.,  see  [140])  can  be 
employed  to  verify  whether  or  not  the  models  are  indeed  proper.  Unless  the  proper  hypothesis  is  confirmpd, 
both  of  the  parametric  models  described  herein  will,  by  the  proofs  of  chapter  3,  be  both  improper  models  and 
inefficient  classifiers  of  X . 


Appendix  G 

Error  Rate  Computations  for  the 
Classifiers  of  Chapter  4 


Recall  from  definition  3. 1  (page  55)  that  the  true  error  rale  Pf(Q\0)  for  the  classifier  of  x  is  given  by 

P,(^|fl)  =  E4P,(g(x|e))]  =  /  ?e(g(x\0))  p,{x)dx,  (G.l) 

■''X. 

where 


P,(g(X|0))  ^  I  -  PH|,{X>(.r|0)  I.r) 

=  1  -  Pvv|,(r(g(ar|0))|;r). 


(G.2) 


and  D  (.r|®)  =  r(0{Ar|^))  denotes  the  class  label  that  the  classifier  assigns  to  its  input  x.  The  error 
rates  that  we  quote  in  chapter  4  —  for  both  the  proper  and  improper  parametric  models  —  are  computed 
according  to  (G.  I )  because  we  play  the  role  of  an  oracle  and  we  know  the  probabilistic  nature  of  the  feature 
X.  The  following  two  sections  outline  the  procedures  we  use  to  do  the  computations. 


G.l  Error  Rate  Computations  for  the  Proper  Parametric  Model 

The  fully-parametric  and  partially-parametric  models  of  section  4.2  form  one  and  only  one  class  boundary 
on  the  domain  of  the  homoscedastic  Gaussian  feature  x.  The  boundary  for  the  fully-parametric  model  is,  by 
(4.8), 


n.  _  f'l  +  /'2 

1 ,2  Fu}h  Pitrametric  - "Z 


(G.3) 
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where  p\  and  p\  are  two  of  the  three  model  parameters.  The  boundary  for  the  partially-parametric  model 
is,  by  (4.12), 


^ \ ^2  PtminllY-Piintmetric  —  «  »  (G.4) 

f’1,1 

where  and  61|,i  are  the  two  model  parameters. 

The  error  rate  of  both  models  can  be  expressed  in  terms  of  the  class  boundary  I3i^2  •  All  examples  of 
class  C(2|  having  values  of  x  greater  than  the  boundary  are  misclassined,  as  are  all  examples  of  class  LJ2 
having  values  of  x  less  than  the  boundary:  mathematically. 


P,(e|fl)  =  Pu(CJ,) 


■£ 


\/2ita 


exp 


II,) 


dx 


+  Ph'(<^2) 


/. 


B,,. 


'/litc 


exp 


dx, 


(G.5) 


The  integrals  in  (G.5)  are  easily  computed  via  the  Chebyshev  approximation  to  the  error  function  (erf)  (106, 
sec.  6.2], 


G.2  Error  Rate  Computations  for  the  Improper  Parametric  Model 

The  error  rates  of  the  polynomial  classiPiers  in  section  4.3  are  evaluated  in  a  crude  but  computationally 
simplistic  fashion.  Since  the  high-complexity  classifier  can,  in  principle,  form  many  class  boundaries  on  the 
domain  of  x,  we  compute  the  integral  of  (G.  I )  numerically,  using  a  successive  approximation  technique. 
This  saves  the  trouble  of  computing  the  class  boundaries  —  essentially  a  polynomial  root-finding  task  — 
and  then  evaluating  the  integral  as  in  (G.S).  Given  the  specific  probabilistic  nature  of  x  in  section  4.3,  the 
classifier’s  error  rate  is  expressed  as 

P,(^|«)  =  /  ~>Ti{'D  {x\9))  p,,y^^.{x\U\)  Pw{UJ\)  dx 

■/-s.s  - - - - ' 

0.05 

+  /  ^T2  {V  (x(0))  (jr|U^2)  Ptv(^^2)  dx 

./-4  ' - -  ■  ■  ■—  ' 
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-■ri  (P(jr|e))  Pn;(t<^.0  dx, 

0.05 


(G.6) 


where 


f  0,  V{x\9)  =  Ui 

^T,{V(x\e))  =  {  (G.7) 

I  I ,  otherwise 


Again,  TD  (x\9)  denotes  the  class  label  that  the  classiHer  assigns  to  its  input  x.  We  use  the  following 
numerical  quadrature  approximation  for  each  of  the  three  integrals  in  (G.6); 


f\r^{V(x\0)) 

Ja 


P.x\W  Pvv(t^ 


b  -  a 
M 


M-\ 

E 

i=o 


r,  (d  („+(/+  i)  .  ^  Id))  ft|„,  ^a+(j+\y^  IW,)  P>v(w,) 


(G.8) 

We  begin  with  Af  =  30  intervals,  and  double  M  until  subsequent  approximations  differ  by  less  than  10“'*. 
This  numerical  integration  technique  is  equivalent  to  the  trapezoid  rule  (e.g.,  [20,  sec.  4.3))  as  M  grows 
large.  It  is  auiiiiUedly  crude,  but  it  is  trivial  to  implement  and  provides  sufficient  precision  for  our  purposes. 


Appendix  H 


Asymptotic  Parameterizations  for  the 
Probabilistically-Generated  Improper 
Parametric  Models  of  Chapter  4 


We  generate  improper  parametric  models  in  section  4.3  by  minimizing  the  mean-squared  error  (MSE) 
between  the  discriminator  output  vector  Y  and  a  corresponding  target  vector  denoting  the  class  of  the 
training  example  (see  section  2.3);  the  minimization  is  done  for  all  examples  in  the  training  sample,  and 
generally  takes  the  form  of  an  iterative  search  procedure.  We  employ  backpropagation,  a  well-known 
probabilistic  learning  paradigm;  its  iterative  search  procedure  is  gradient  descent,  and  the  gradient  of  the 
classifier's  MSE  with  respect  to  the  parameter  vector  9  is  computed  by  the  chain-rule  [119,  1201.' 

We  denote  the  training  sample  of  size  n  by  S”,  and  we  denote  a  particular  unique  value  (or  pattern)  of  x 
by  Xp.  If  there  are  P  unique  patterns  in  S" .  and  for  each  of  these  patterns  there  are  examples  belonging 
to  class  Cc*/,  the  sample  MSE  of  the  classifier  Q{x\9)  =  [gi(.r|®).  ...  ,gi'(x|^)}  with  parameterization 
0  is  given  by 


c  p 


MSE{S''\0)  = 


(g,(x,|0)  -  1)^ . 


H  "r..  =  « 

P=i 


(H.I) 


Recall  that  C  denotes  the  total  number  of  classes  that  x  can  represent.  In  the  case  of  the  random  variable 
X  in  section  4.3,  C  =  3.  It  is  straightforward  to  prove  that  the  classifier's  MSE  can  be  expressed  by  the 
following  expectation  as  the  training  sample  size  grows  asymptotically  large  (see  section  2.3.2): 

'  Bockpropogaliim  ^nerally  employs  MSE.  although  olher  objective  functions  can  be  used.  We  employ  only  the  MSE  objective 
function  for  probabilistic  learning.  The  CE  objective  function,  for  example,  cannot  be  used  because  the  polynomial  classifier's  outputs 
are  unbounded;  this  violates  the  conditions  necessary  for  using  CEtsee  section  2..t.2). 
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lim  MSE  (5"  1 0)  =  E,  [MSE  1 0)]  = 

I  [u,(^l®)  -  1)'  •  Pu|,(^.k)  +  •  Pu|r(-u;,|jt)]  p^{x)dx  (H,2) 

where  the  notation  Er  ( ■  ]  denotes  the  expectation  over  the  domain  of  .r,  and 

Pvvir(-a;,|.v)  =  1  -  P»v|,(a;,|.r)  (H.3) 

The  parameterization  0'  that  minimizes  the  classifier's  MSE  can  be  found  by  substituting  the  ex¬ 
pressions  for  the  classifier’s  discriminant  functions  into  (H.2).  deriving  the  expression  for  the  gradient 
(Ex  [MSE  (x|  0(jr|  ©*))] ),  setting  this  gradient  equal  to  the  zero  vector,  and  solving  the  resulting 
normal  equations  for  0‘ .  Barnard  and  Casasent  use  this  technique  for  deriving  the  minimum-MSE  param¬ 
eterization  of  a  linear  classifier,  given  a  2-class  Guassian  feature  [6].  We  derive  distribution-independent 
expressions  for  the  asymptotic  minimum-MSE  parameterization  of  the  ith  discriminant  function  gi{x\0} 
in  (H.2);  expressions  are  given  for  constant,  linear,  and  quadratic  discriminant  functions.  Distribution- 
independent  expressions  for  the  minimum-MSE  parameterizations  of  higher-order  polynomial  discriminant 
functions  become  cumbersome,  so  we  derive  the  minimum-MSE  parameterization  of  the  high-complexity 
classifier  (i.e.,  the  MSE-generated  "10-10-10”  model)  in  distribution-dependent  form.  We  use  the  proba¬ 
bilistic  nature  of  the  feature  x,  described  by  (4.28)  —  (4.29). 

The  polynomial  discriminant  functions  of  the  improper  parametric  model  are  described  by  (4.32).  Since 
no  polynomial  discriminant  function  in  (4.32)  shares  parameters  with  another,  we  can  minimize  the  MSE  for 
each  discriminant  function  independent  of  the  other  discriminant  functions.  The  operative  equation  therefore 
becomes 


lim  MSE(5"lg,(x|9))  =  Ex[MSE(x|g,(x|<?))]  = 

oc 

J  -  1)^  ■  Pw’|T(C<^.k)  +  •  P,v|x  (-<A?/|x)]  p,{x)dx  (H.4) 

If  the  Ith  discriminant  function  has  Ki  parameters,  we  derive  that  many  normal  equations  of  the  form 

j^Ex[MSE(x|g,(x|0'))] 

I  -  1)^  •  p>viatc',|x)  -I-  (g,(x|»*))^  •  Pvv|t  n,ix)dx 
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=  I"  ^2  -  1)  ■  ■  Pvv|.(U^.U) 

=  0  (H.5) 

(where  denotes  the  itth  element  of  the  parameter  vector  9'  that  minimizes  the  MSB  of  the  ilh 
discriminant  function)  in  order  to  solve  for  the  minimum-MSE  parameterization  0“. 

H.l  Distribution-Independent  Expressions  for  the  Parameterization  of 
Low-Order  Polynomial  Discriminant  Functions 

When  the  discriminant  function  in  (H.4)  is  a  constant,  i.e., 

(H.6) 

we  denote  the  value  of  the  parameter  that  minimizes  E,  [MSE  (jr|^,(x|®))]  in  (H.4)  by  6-  q  .  Substituting 
(H.6)  into  (H.5),  we  obtain  the  single  normal  equation; 


^E,[MSE(.tU,(x|r))]  = 

f  ^C' 

f  [(^^0  -  0^  •  P»v|t(W,  |x)  +  (O‘o)^  •  P)V|,  (--U),  lx)]  P,{x)dx 

=  /"  [2  -  0  •  P>vh(u^,  |x)  +  20*0  •  Pwi.  p,[x)dx 

=  2  (0*0  -  Pvv(a;/)) 

=  0  (H.7) 

Thus 


Ko  =  PwlCO.) 

When  the  discriminant  function  in  (H.4)  is  a  linear  function  of  x,  i.e., 

g,{x|0)  =  0,,|  X  +  0,,o. 


(H.8) 


(H.9) 


the  two  normal  equations  arc  given  by 
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^E,[MSE(.tU,(.t|0-))]  = 

^  +  Oj  Q  ~  0  ■  P»v|.t  (^1  l-’f)  +  (^1,1  +  ^(,o)  ■  P»v|i  (“^1 1-*^)] 

=  0,  (H.IO) 


(k  =  0,1).  Expanding  and  solving  them  in  the  manner  leading  to  (H.7)  yields  the  following  MSE-minimizing 
parameters:  ^ 


^i’,i  =  [mi(x,a;i)  •P)v{-'^i)  -  itii  (.t.-'o;,)  •Pw(^i)]  /C,i 

Olo  =  [m^Ct)  ■  P,v(ci;,)  -  mifx.cJ.)  •  m,  (.r)]  /C,i 


(H.ll) 


where 


Cm  =  m2  (X)  -  (tn,  (x))^ 

m>  (X ,  a;,)  =  E,  [(x)>  |  a>,]  •  Pw(w,) 
c 

m;(x.-a;,)  = 

A=l 

k^i 

C 

m,(x)  =  E,[(x)>l  =  53  e4(x)>1u;*]  •  Pw(W/) 

*=i 

When  the  discriminant  function  in  (H.4)  is  a  quadratic  function  of  x.  i.e., 

gi{r\9)  =  9i,2  (x) ^  +  0,,i  X  +  6(,.o ,  (H.  1 3) 

the  three  normal  equations  are  given  by 


^E4MSE(xlg,(xlfl*))]  = 

+  ^1,1  •*  +  ^>,0  —  0  ■  Pw|jr  (t*^i  I-*) 

+  (i9.,2{x)^  +  6t„ix  +  9ifi)^  ■  Pwi.t(-’t*^iU)]  Px(x)dx 
=  0.  (H.14) 


^Equation  (H,5)  repteaents  a  disiritnition-independent  multi-class  generalizaiion  of  a  result  for  the  2-class  Guassian  feature  in  (6], 
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{k  =  0,1,2).  Expanding  and  solving  them  in  the  manner  leading  to  (H.7)  yields  the  following  MSE- 
minimizing  parameters: 

^',*2  =  [Pn  (ti;,)  [{m2(.t))^  -  mi(.r)  •  m,,(.r)] 

[(m,(x))^  -  mjW] 

+  m,(A:,a;,)  •  -  m,(.v)  •  m2(.r)l]  /C.2 

6)*,  =  [p»v(a;i)  [mi  (jr)  •  mtfjt)  -  m2(.r)  •  m.,  (Jf)] 

+  m2(x,a;,)  [mjfjr)  -  mi  w  •  m2(jr)]  (H.I5) 

+  mi(jc.a;,)  [(mifx))^  -  m4{x)]]  /C/,2 

^,*0  =  [Ptv(tt;/)  [(m.iW)^  -  m2(.r)  •  m4(jf)] 

+  m2(jr,a;,)  [(m2(jr))^  -  mi(.r)  •  m.,w] 

+  mi  (jf ,CJ,)  [mi  (jt)  •  m4  W  -  m2  W  •  m,  (.t)]  ]  /  0,2 

where 

<.,2  =  (mafA:))^  +  (m,i(Ar))^  +  m4(x)  •  [(mi(jr))^  -  m2(.r)]  -  2  mi  (at)  •  m2  (at)  •  mj(-t)  (H.16) 
The  jxh  moment  of  a  uniformly-distributed  random  variable  with  lower  and  upper  bounds  of  /  and  u  is 

Using  (H.17)  and  (4.28)  —  (4.29),  Equations  (H.8)  —  (H.16)  can  be  evaluated.  The  resulting  values  for  the 
minimum-  and  low-complexity  polynomial  classiHers  of  the  homoscedastic  uniformly-distributed  random 
variable  in  section  4.28  are  given  in  the  top  and  middle  entries  of  table  4. 1  (page  104). 

H.2  Distribution-Dependent  Expressions  for  the  Parameterization  of 
Polynomial  Discriminant  Functions 

The  preceding  distribution-independent,  minimum-MSE  parameter  expressions  are  cumbersome.  Fortunately, 
the  piece-wise  constant  nature  of  the  class-conditional  pdfs  and  a  posteriori  probabilities  of  at  (see  figure  4.9) 
allow  a  straightforward  expansion  of  (H.5),  by  which  compact  expressions  for  the  minimum-MSE  polynomial 
discriminant  function  parameters  can  be  obtained.  Equation  (H.S)  can  be  re-stated  as  follows: 
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^E,[MSE(.tU,(jr|»'))]  = 

=  2  (j?,(-r|^')  -  1)  •  •  /’xnv  dx 

i,  /*^'  J 

+  2  53  /  ■  ^yviuij)  •  p,n^.  (x|a;y)  dx 

J=t 

yV- 

=  0  (H.18) 

Using  (4.28)  and  (4.29).  we  can  express  the  itth  normal  equation  for  the  three  polynomial  discriminant 
functions  thus  (dropping  the  factor  of  2): 

^E,[MSE(x|g,(xlr))]  = 

•05  f  U,(A:|r)  -  I)  .  ^g,(x\e')dx 

J-i.i  “''I,* 

+  •>  I  ^g\{x\9‘)  ■  ■J^^g^(x\0^)dx 

+  .05^^  g\(x\e')  ■  -~^^g\(x\0*)dx 

=  0  (H.I9) 


^E.(MSE(x|g2{xm)]  = 

•05  f  ^g2{x\9‘)dx 

./-5.8  «v2,ji 

+  •>  I  ^  («2U|»*)  -  I)  • 

+  05/  g2{x\0‘)  ■  •^g2{x\0')dx 

./.1.8  “  “2,* 


0 


(H.20) 
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dOy 


E,[MSE(xlg,(jr|r))]  = 


•05  /  /f.Uir)  -  ^gy{x\e')dx 

./-5  8 

+  .1  I  ^gs{x\0')  ■  ^~gy{x\0')dx 
+  05  («,i(x|e‘)  -  1)  •  j^^sAx\0')dx 


=  0 


(H.21) 


We  use  the  normal  equations  of  (H.19)  —  (H.21)  to  solve  for  the  minimum-MSE  parameterization  of  the 
high-complexity  polynomial  classiFier.  Since  there  are  three  lOth-ordcr  polynomial  discriminant  functions, 
there  are  three  sets  of  ten  equations  with  ten  unknowns.  The  resulting  lOth-order  parameters  are  listed  at  the 
bottom  of  table  4. 1 ,  page  104;  they  were  computed  from  the  normal  equations  with  Mathematica.’ 


''Malhematica  is  a  registered  trademark  of  Wolfram  Research.  Inc. 


Appendix  I 


Monotonic  Fractions  Generated  by 
Three  Error  Measures 


This  appendix  supports  section  5.3;  it  contains  derivations  for  the  monotonic  fractions  of  discriminator 
output  space  generated  by  the  mean  absolute  error  (MAE),  mean-squared  error  (MSE),  and  KuIIback-Leibler 
information  distance  (CE)  objective  functions.  All  derivations  are  performed  for  the  discriminator  output 
space  y  =  (0 , 1  ,  rather  than  the  more  general  space  y  =  [/,  /i]^ .  This  is  done  to  simplify  the  notation. 
In  the  case  of  the  MAE  and  MSE  error  measures,  the  derivations  for  y  =  (0,1)^  yield  results  that  are 
identical  to  those  for  the  more  general  space,  owing  to  the  scalable  properties  of  the  MAE  and  MSE  objective 
functions.  In  the  case  of  the  CE  error  measure,  the  derivations  for  y  =  [0, 1)*^  yield  results  that  are 
nor  identical  to  those  for  the  more  general  space,  owing  to  the  non-scalable  properties  of  the  CE  objective 
function.  Unfortunately,  the  CE  derivations  for  y  =  are  tedious,  so  we  limit  ourselves  to  treatment 

of  the  space  ^  =  [0,1]^.  These  specific  results  are  qualitatively  representative  of  those  for  the  more  general 
space,  as  long  as  /  and  h  are  finite  —  a  constraint  that  is  consistent  with  (2.60). 

As  in  section  5.3,  we  assume  that  Vt  is  always  yi  in  order  to  simplify  notation  further. 

I.l  MAE  Monotonic  Fractions 

Assuming  -<D  =  /  =  0  and  D  =  h  =  I ,  (5.37)  and  (5.38)  ensure  that  an  example  of  Ui  is  correctly 
classified  if  MAE  <  I ,  such  that 


c 


H  yi 

i=2 


(1.1) 


Thus,  we  can  compute  the  monotonic  correct  fraction  of  discriminator  output  space  MAEC.Fmnw,(C)  by 
using  (I.  I )  to  set  the  limits  of  integration  for  the  C-tuple  cardinality  (i.e.,  volume)  integral  thus: 
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Assuming  ->0  = 
classified  if  MAE  > 


Equivalently, 
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/  /  /  •••  /  dye  ■  dy2  dy\ 

Jo  Jo  Jo  Jo 


j:  •  t 

i:  ■  i 


C  —  \  integral  terms 

r.-Er ' 


VI  -  E  >■> 
>=2 


dyc~i  •••  </vi 


C-2 

n-Eyj-yc- 

>=2 


t/yc-i  •••  </vi 


i:  r 

/'  - 
./o  ./o 


c-2 

.VI  -  II  >V  -  .Vt’-I 
i=i 


u»t_2 


dyi 


''->-1  1 


c-2 

-VI  -  H  .V; 

y=2 


dyV-i  •••  </>’! 


EC -4 

/=!  I 


2  •  3 


>’i  -  H 
1=2 


dyc-y  •••  dy\ 


I  I 


C\  nC  +  I) 


(1.2) 


'  =  0  and  D  =  h  =  \ ,  (5.33)  —  (5.36)  ensure  that  an  example  of  UJ\  is  incorrectly 
:  -  1 : 


1  -  yi  +  H  ^  -  1 

/=2 


a.3) 


)’•  ^  Y^yj  -  (c  -  ’2) 
/=2 


(1.4) 


If  we  let  v)  =  I  -  y> ,  (1.4)  can  be  expressed  as 
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c 


i=2 


(1.5) 


Thus,  we  can  compute  the  monotonic  incorrect  fraction  of  discriminator  output  space  MAE  by 

using  (1.5)  to  set  the  limits  of  integration  for  the  C-tuple  cardinality  (i.e..  volume)  integral  in  preci.sely  the 
same  manner  we  use  (I.  I )  to  set  the  limits  of  (1.2): 


C  —  1  imegrat  terms 


I  I 

t’!  “  r(C  +  1) 


(1.6) 


Recall  from  (5.30) 


so,  by  (1.2),  (1.6),  and  (1.7), 


MiF  =  ITmono  + 


UkECT„,„„„{C) 

/.  MAE.VI.F(C) 


Thus,  (5.40)  —  (5.42)  are  derived. 


1  I 

a  ~  r(c  +  I) 

J _ 1^ 

C!  ~  r(C’  +  1) 

2  2 

C!  “  nC  +  1) 


(1.7) 


(1.8) 


1.2  MSE  Monotonic  Fractions 

Assuming  ->D  =  I  =  0  and  D  =  h  =  1 ,  (5.52)  and  (5.53)  ensure  that  an  example  of  iV,  is  correctly 
classified  if  MSE  <  ^  ;  one  point  in  y  generating  this  value  of  MSE  occurs  at 


.Vi 


JV  =  0  V*  >  2 


(1.9) 


Thus,  the  lower  bound  on  yj  in  (5.54)  necessary  to  ensure  that  an  example  of  tc'i  is  correctly  classified 


reduces  to 
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Vi  >  1 


(MO) 


It  can  be  shown  that  (1. 10)  yields  an  expression  for  MSECJF„„„„(CT)  that  is  equal  to  2“‘’  times  the  volume 
of  the  C-dimensional  sphere  with  radius  ^  ,  centered  at  the  point  Yenmo  (recall  from  (5.3)  that  Yrnnea 
is,  in  the  case  of  an  example  of  C(2| .  the  point  at  which  yi  =  I  and  y,  =  OVy  /  I). 

Given  the  volume  of  C-dimensional  sphere  with  radius  ^  [4,  pg.4ll]. 


MSECT„„„„(C) 


(Ml) 


Assuming  -iD  =s  /  =  0  and  D  =  h  =  \ ,  (5.47)  —  (5.50)  ensure  that  an  example  of  Ct^i  is  incorrectly 
classified  if  MSE  >  ; 


(1  -  V,)'  +  >  C  -  I 

/=2 


(M2) 


Equivalently, 


.VI  <  I 


y=2 


(M3) 


Equation  (M3)  leads  to  a  relatively  complicated  C-tuple  cardinality  (i.e.,  volume)  integral  for 
lASEXT mnm{C) .  We  spare  the  effort  of  evaluating  the  integral  explicitly  by  bounding  it  from  above 
as  follows.  Compare  the  condition  for  a  surely  MSE-misciassified  example  in  (M2)  with  that  for  a  surely 
MAE-misclassifled  example  in  (1.3).  Equation  (M2)  defines  the  inner  boundary  — as  measured  from 
Yenrrect  —  of  the  monotonic  region  of  incorrect  space,  given  the  MSE  objective  function.  Likcwise,(1.3) 
defines  the  inner  boundary  —  as  measured  from  Yavno  —  of  the  monotonic  region  of  incorrect  space, 
given  the  MAE  objective  function.  In  fact,  the  set  of  discriminator  output  states  that  satisfy  (1. 1 2)  describes  a 
convex  hypersurface.  Each  point  on  this  hypersurface  has  a  Euclidean  distance  from  Ycma  that  is  at  least  as 
great  as  that  of  any  point  on  the  hyperplane  described  by  (1.3).  As  a  result,  the  monotonic  region  of  incorrect 
space  generated  by  the  MSE  objective  function  is  always  fully  enclosed  by  the  monotonic  region  of  incorrect 
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space  generated  by  the  MAE  objective  function.  It  follows  immediately  that  the  monotonic  incorrect  fraction 
of  discriminator  output  space  generated  by  MSE  is  bounded  from  above  by  its  MAE-generated  counterpart: 


MSEI.r„,„„„(C’)  <  MAEI^„,„„„(t’) 

MSEI:F,„^„(C)  <  (J14) 

Since  the  upper  bound  of  (1.14)  decreases  super-cxponentially  with  C .  so  does  MSEIJ^mm„(C) .  By  (1.7), 
(1.11),  and  (1.14), 


MSEJ:r„,„„„(t’)  < 

MSEC’jr„,„m.(t’)  = 


r(t’  +  1) 

(f)^ 

r(S  +  I) 

(S)* 


<  nlTi)  ^  TicTT)  <  f(fTl) 

Thus,  (5.55)  —  (5.57)  are  derived. 

1.3  Kullback-Leibler  Monotonic  Fractions 


t’  >  2 


(1.15) 


Assuming  -<D  =  1  =  0  and  D  =  h  =  1 ,  the  CE  expression  of  (5.62)  reduces  to 


CE  =  -log(yi)  -  53  'o?('  - 
/=2 

The  minimum  value  of  CE  generated  by  an  incorrectly  classified  example  of  Ui  is,  by  (5.68),  —  log(A) , 
where  A  =  ^ .  Thus,  the  monotonic  correct  fraction  of  di.scriminator  output  space  is  described  by  the  set  of 
points  in  y  satisfying  the  following  equation: 


logO’i)  +  5Z  “  yi'^ 
/=2 

Equivalently, 


y,  •  n  (•  -  •'’>)  < 

y=2 


(1.18) 
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Thus,  we  can  compute  the  monotonic  correct  fraction  of  discriminator  output  space  ((7)  by  using 

(1.18)  to  set  the  limits  of  integration  for  the  C-tuple  cardinality  (i.e.,  volume)  integral  thus: 


CE  mono  — 

/**  “  ^/y\  r*  “  ^  I  n  ~  >2)] 


nl  -  A/V| 

■lo 


I 


-  A  lv,  .n;lV 


dye  ■■■  dy2  (hi  (1.19) 


C  —  1  integral  terms 


Using  the  short-hand  notation 


Be -It  —  A 


C-k 


.V.  •  n  -  -'v) 


-I 


we  restate  (1.19)  thus; 


(1.20) 


I  -  Bc-i 


I  —  Be -I  dyc-\  dyi 


I  -  Be -2 


I  -  .Vo- 1 


Be -2  dye -I  ■■■  dyi 


I  -  Be_i 


1  -  Be. 


yo-i  +  Be-2  •  In  (1  -  Vf.i) 


I  -  Be-: 


CECJ^^iO  = 

j:-l 

-  i:-L 

-  i:-L 


dye -2  dyi 


1  —  Be~2  +  Be-2  •  In  (Bt*_2)  dye -2  ••• 


•'  dv 


'V| 


=  i-a-E 


j=0 


MV 

yi 


y  =  (/  =  O.ft  -  1]^  A  =  1 


Thus,  (5.70)  is  derived. 

Assuming  the  more  general  ->/)  =  /  and  D  =  h,  the  CE  expression  of  (5.62)  is 


(1.2 1 ) 
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CE  =  -log(v,  -  /)  -  53  log  (A  -  yj)  (1.22) 

i=2 

The  minimum  value  of  CE  generated  by  an  incorrectly  classified  example  of  Ui  is,  by  (5.68), 
-C  •  logfh  —  /)  -  log(A),  where  A  =  j.  Thus,  the  monotonic  correct  fraction  of  discriminator 
output  space  is  described  by  the  set  of  points  in  y  satisfying  the  following  equation: 

c 

log(^vi  -  1)  +  Y^\Qg{h  -  y^)  <  C  •  log(/i  -  I)  +  log(A)  (1.23) 

j=2 

Equivalently, 


(VI  -  /)  •  n  ~  -  ‘f  (f-24) 

i=2 

Thus,  the  more  general  version  of  (1.2 1 )  is  given  by 


CEC:F„«„(C)  = 

I 


(/I  -  /) 


I.  -  [X  (A  -  /!<■]  [iv,  -  /)  '*  -  '>)]  ■' 


dye  •  •  •  dy2  dy, ,  (1.25) 


C  “  I  intcpal  lenns 


where 


C  =  A  •  (/I  +  3/)  (1.26) 

Evaluating  (1.25)  becomes  a  non-trivial  exercise  in  bookkeeping,  which  we  omit  for  the  sake  of  brevity.  As 
mentioned  previously,  the  derivations  for  ^  =  (0,1]*’  yield  results  that  are  qualitatively  representative  of 
those  for  the  more  general  space,  as  long  as  /  and  h  are  finite  —  a  constraint  that  is  consistent  with  (2.60). 


Appendix  J 

Tabulated  Die  Casting  Bounds 


Section  6.4.1  derives  a  greatest  lower  bound  ~  on  the  number  of  die  casts  necessary  for  the  most 
likely  face  6^(1)  of  an  unfair  die  to  become  empirically  evident  with  probability  at  least  d  =!-</=  .95: 


'''nA(P(C‘^(t|)<P(^^(2i).<^  =  -05] 


> 


p(a;(,))  •  (I  -  P(u;(,)))  +  p(u^(2))  •  (»  -  p(u^(2))) 

(P(a;„))  -  P(a;,2,))2 


a  I) 


Recall  that  P(U/(,|)  is  short-hand  notation  for  P»vix(t<^(i)  |X) .  Through  the  Monte  Carlo  simulations 
tabulated  below,  we  have  found  that  the  most  likely  die  face  remains  empirically  evident  with  empirical 
probability  not  less  than  .95  if  <  above  is  reduced  from  the  Chebyshev-imposed  value  of  20  to  9. 

The  following  table  compares  ~  (<  =  9  in  (J.I)  above)  with  empirical  estimates  of  ,  which 

we  denote  by  .  The  empirical  estimates  were  obtained  by  simulating  1,000  independent  die  casting 
sequences,  each  having  up  to  10,000  casts,  for  each  tabulated  value  of  PlUtdi)  and  P(Cc),2)) .  The  value 

of  ft  above  which  Cc'di  became  empirically  evident  (i.e..  above  which  P(CJd) )  remained  maximal  for  all 
subsequent  casts  of  the  die)  was  recorded  for  each  trial.  The  values  of  n  for  all  1 ,000  trials  were  sorted,  and 
was  taken  to  be  the  value  of  n  at  the  950th  position  from  the  bottom  of  the  sorted  list  (i.e.,  the  95th 
percentile  of  the  1 ,000  trials). 

The  number  of  faces  C„i„  on  the  die  for  each  set  of  1 ,000  trials  was  chosen  to  be  the  smallest  number  for 
which  the  lesser  ranked  face  probabilities  did  not  exceed  P(U>i2i ) : 


choose  C„i„  s.l.  P(U^(/))  <  P((^{2»)  Vy  >  2  (J.2) 

The  lesser  ranked  probabilities  P(6r2(3) P(Cct(c-)i )  were  set  to  the  value  of  P(U^(2) ) ;  the  remaining 
probability  P(C(^(c))  was  set  to  the  value  I  —  P(tU(,)).  This  choice  of  C  and  the  lesser  ranked 

face  probabilities  is  approximately  worst-case,  in  that  it  ensures  that  the  lesser  ranked  face  probabilities  are 
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as  large  as  possible.  This,  in  lum,  (ends  to  maximize  the  (he  number  of  die  casts  necessary  for  the  most 
likely  die  face  to  become  empirically  evident,  given  a  particular  choice  of  top  ranked  face  probabilities' 

(P(u;,„),P{Cci,2))>. 

The  values  of  n^  listed  below  are  not  rounded  up  for  a  posteriori  class  differential  values  greater  than  .2 
(i.e..  for  Avv|x(^^(i)  I  =  P{f^(ii)  ~  P(f<^(2))  >  -2 );  they  are  rounded  up  to  the  nearest  value  divisible 
by  5  for  a  posteriori  class  differential  values  greater  than .  1 ;  they  are  rounded  up  to  the  nearest  value  divisible 
by  10  for  a  posteriori  class  differential  values  not  greater  than  .  I .  The  bound  ~  ns  is  a  relatively  tight  one 
on  its  ,  not  over-estimating  n^  by  more  than  about  100%.  Likewise,  ~  under-estimates  only  for 
values  of  /ia  $  10. 


r  . 

nrin 

Empirical  Number  of  Casts 
n^lP(UJ(i)),P(UJ(2)),d  = 

.05] 

Bound 

~  maIP(u;(1))  ,P(ct;,2 

0.1 

46 

140 

155 

0.1 

24 

260 

321 

0.1 

16 

640 

824 

0.1 

13 

2800 

3681 

0.12 

45 

100 

113 

0.12 

23 

170 

203 

0.12 

16 

320 

405 

0.12 

12 

730 

1008 

0.12 

3190 

4401 

0.14 

44 

75 

88 

0.14 

23 

130 

143 

0.14 

190 

249 

0.14 

12 

370 

485 

0.14 

860 

1184 

0.14 

9 

3460 

5085 

0.16 

43 

60 

71 

0.16 

22 

90 

109 

0.16 

15 

140 

172 

0.16 

12 

230 

293 

0.16 

562 

0.16 

0.12 

8 

990 

1351 

0.16 

0  14 

7 

4050 

5734 

0.18 

0.02 

42 

50 

59 

0.18 

0.04 

22 

75 

86 

0.18 

0.06 

15 

100 

128 

0.18 

0.08 

12 

160 

200 

=  .05] 


'  A  rigorou.<i  proof  of  this  aitsertion  i.s  beyond  our  interest  and  .stamina;  we  choose  this  protocol  as  a  reasonable  means  of  approximating 
the  number  and  values  of  the  lesser-ranked  face  probabilities  that  would  result  in  the  largest  number  of  die  casts  required  for  the  most 
likely  face  to  become  empirically  evident,  given  the  probabilities  for  the  two  most  likely  die  faces. 
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P(Ct^(n) 

PiuJa)) 

r  . 

^  min 

Empirical  Number  of  Casts 

.051 

Bound 

~/(a[P(Ci7(|)),P(C<7(2)) 

0.18 

0.1 

10 

250 

335 

0.12 

8 

440 

634 

0.18 

0.J4 

7 

1050 

1508 

0.18 

0.16 

7 

4420 

6346 

0.2 

0.02 

41 

45 

50 

0.2 

0.04 

21 

55 

70 

0.2 

0.06 

15 

80 

100 

0.2 

0.08 

11 

120 

147 

0.2 

O.I 

9 

160 

226 

0.2 

0.12 

8 

270 

374 

0.2 

0.14 

7 

520 

702 

0.2 

0.16 

6 

1160 

1657 

0.2 

0.18 

6 

4250 

6922 

0.22 

0.02 

40 

40 

44 

0.22 

0.04 

21 

50 

59 

0.22 

0.06 

15 

65 

81 

0.22 

0.08 

11 

90 

113 

0.22 

0.1 

9 

no 

164 

0.22 

0.12 

8 

180 

250 

0.22 

0.14 

7 

290 

411 

0.22 

0.16 

6 

530 

766 

0.22 

0.18 

6 

1100 

1796 

0.22 

0.2 

5 

4890 

7462 

0.24 

0.02 

39 

33 

38 

0.24 

0.04 

20 

45 

50 

0.24 

0.06 

14 

55 

67 

0.24 

0.08 

11 

70 

91 

0.24 

0.1 

9 

90 

126 

0.24 

0.12 

8 

135 

181 

0.24 

0.14 

7 

180 

273 

0.24 

0.16 

6 

290 

446 

0.24 

0.18 

6 

570 

826 

0.24 

0.2 

5 

1120 

1927 

0.24 

0.22 

5 

4820 

7966 

0.26 

0.02 

38 

29 

34 

0.26 

0.04 

20 

34 

43 

0.26 

0.06 

14 

50 

56 

0.26 

0.08 

11 

60 

74 

0.26 

0.1 

9 

75 

100 

0.26 

0.12 

8 

105 

137 

0.26 

0.14 

7 

135 

196 

0.26 

0.16 

6 

210 

295 

0.26 

0.18 

6 

350 

479 

0.26 

0.2 

5 

560 

882 

0.26 

0.22 

5 

1290 

2048 

0.26 

0.24 

5 

5540 

8434 

0.28 

0.02 

37 

25 

30 
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I 


P(^(2|) 

C„,i„  Empirical  Number  of  Casts 

/i2[P(C(;,i)),P(t^(2)).</  = 

.051 

Bound 

~  »a(P(^(  I )  )  .  P(^(2)  ) 

0.28 

0.04 

19 

31 

38 

0.28 

0.06 

14 

39 

48 

0.28 

0.08 

10 

55 

62 

0.28 

0.1 

9 

60 

82 

0.28 

0.12 

7 

80 

109 

0.28 

0.14 

7 

105 

148 

0.28 

0.16 

6 

145 

211 

0.28 

0.18 

6 

230 

315 

0.28 

0.2 

5 

310 

509 

0.28 

0.22 

5 

630 

933 

0.28 

0.24 

5 

1320 

2160 

0.28 

0.26 

4 

5260 

8865 

0.3 

0.02 

36 

22 

27 

0.3 

0.04 

19 

26 

34 

0.3 

0.06 

13 

34 

42 

0.3 

0.08 

10 

38 

53 

0.3 

0.1 

8 

55 

68 

0.3 

0.12 

7 

65 

88 

0.3 

0.14 

6 

80 

117 

0.3 

0.16 

6 

no 

159 

0.3 

0.18 

5 

140 

224 

0.3 

0.2 

5 

210 

333 

0.3 

0.22 

5 

360 

537 

0.3 

0.24 

4 

570 

981 

0.3 

0.26 

4 

1350 

2264 

0.3 

0.28 

4 

5870 

9261 

0.32 

0.02 

35 

20 

24 

0.32 

0.04 

18 

26 

30 

0.32 

0.06 

13 

31 

37 

0.32 

0.08 

10 

37 

46 

0.32 

0.1 

8 

47 

58 

0.32 

0.12 

7 

60 

73 

0.32 

0.14 

6 

75 

94 

0.32 

0.16 

6 

80 

124 

0.32 

0.18 

5 

115 

168 

0.32 

0.2 

5 

170 

236 

0.32 

0.22 

5 

250 

351 

0.32 

0.24 

4 

290 

563 

0.32 

0.26 

4 

660 

1025 

0.32 

0.28 

4 

1510 

2358 

0.32 

0.3 

4 

5770 

9621 

0.34 

0.02 

34 

19 

22 

0.34 

0.04 

18 

23 

27 

0.34 

0.06 

12 

30 

33 

0.34 

0.08 

10 

33 

40 

0.34 

0.1 

8 

38 

50 

0.34 

0.12 

7 

45 

62 
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P(t^(2|) 

c 

'-iwm 

Empirical  Number  of  Casts 

.05] 

Bound 

~WA[P(ci;(i,),P(a;(; 

6 

65 

78 

6 

75 

100 

0.34 

0.18 

5 

90 

131 

0.34 

0.2 

5 

115 

177 

0.34 

0.22 

5 

170 

248 

0.34 

0.24 

4 

240 

367 

0.34 

0.26 

4 

340 

587 

0.34 

0.28 

4 

680 

1065 

0.34 

0.3 

4 

1600 

2444 

0.34 

0.32 

4 

6360 

9945 

0.36 

0.02 

33 

16 

20 

0.36 

0.04 

17 

22 

24 

0.36 

0.06 

12 

25 

29 

0.36 

0.08 

9 

30 

35 

0.36 

0.1 

8 

31 

43 

0.36 

0.12 

7 

41 

53 

0.36 

0.14 

6 

47 

66 

0.36 

0.16 

5 

63 

83 

0.36 

0.18 

5 

80 

105 

0.36 

0.2 

5 

95 

138 

0.36 

0.22 

4 

110 

185 

0.36 

0.24 

4 

165 

258 

0.36 

0.26 

4 

250 

381 

0.36 

0.28 

4 

390 

608 

0.36 

0.3 

4 

680 

1101 

0.36 

0.32 

3 

1530 

2520 

0.36 

0.34 

3 

5360 

10233 

0.38 

0.02 

32 

16 

18 

0.38 

0.04 

17 

19 

22 

0.38 

0.06 

12 

24 

26 

0.38 

0.08 

9 

25 

31 

0.38 

0.1 

8 

29 

38 

0.38 

0.12 

7 

37 

46 

0.38 

0.14 

6 

42 

56 

0.38 

0.16 

5 

50 

69 

0.38 

0.18 

5 

62 

87 

0.38 

0.2 

5 

85 

no 

0.38 

0.22 

4 

105 

144 

0.38 

0.24 

4 

120 

192 

0.38 

0.26 

4 

175 

268 

0.38 

0.28 

4 

245 

394 

0.38 

0.3 

4 

390 

627 

0.38 

0.32 

3 

570 

1133 

0.38 

0.34 

3 

1370 

2588 

0.38 

0.36 

3 

4950 

10485 

0.4 

0.02 

31 

14 

17 

0.4 

0.04 

16 

17 

20 

.05] 
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P(u;,2)) 

C„ii„  Empirical  Number  of  Casts 

^(P(^(l))’P(^(2t)»<^  = 

.05] 

Bound 

~  «A[P(^(l)).P(t<^(2) 

0.4 

0.06 

11 

20 

24 

0.4 

0.08 

9 

23 

28 

0.4 

O.I 

7 

25 

33 

0.4 

0.12 

6 

33 

40 

0.4 

0.14 

6 

37 

48 

0.4 

0.16 

5 

46 

59 

0.4 

0.18 

5 

53 

73 

0.4 

0.2 

4 

71 

90 

0.4 

0.22 

4 

75 

115 

0.4 

0.24 

4 

95 

149 

0.4 

0.26 

4 

110 

199 

0.4 

OuS 

4 

185 

276 

0.4 

0.3 

3 

245 

405 

0.4 

0.32 

3 

340 

644 

0.4 

0.34 

3 

540 

1161 

0.4 

0.36 

3 

1340 

2646 

0.4 

0.38 

3 

5790 

10701 

0.42 

0.02 

30 

14 

15 

0.42 

0.04 

16 

15 

18 

0.42 

0.06 

II 

17 

21 

0.42 

0.08 

9 

21 

25 

0.42 

0.1 

7 

23 

30 

0.42 

0.12 

6 

28 

35 

0.42 

0.14 

6 

36 

42 

0.42 

0.16 

5 

36 

51 

0.42 

0.18 

5 

43 

62 

0.42 

0.2 

4 

53 

76 

0.42 

0.22 

4 

61 

94 

0.42 

0.24 

4 

80 

119 

0.42 

0.26 

4 

95 

154 

0.42 

0.28 

4 

140 

205 

0.42 

0.3 

3 

160 

284 

0.42 

0.32 

3 

200 

416 

0.42 

0.34 

3 

330 

659 

0.42 

0.36 

3 

550 

1185 

0.42 

0.38 

3 

1590 

2696 

0.42 

0.4 

3 

5630 

10881 

0.44 

0.02 

29 

12 

14 

0.44 

0.04 

15 

15 

17 

0.44 

0.06 

11 

17 

19 

0.44 

0.08 

8 

19 

23 

0.44 

0.1 

7 

20 

27 

0.44 

0.12 

6 

26 

31 

0.44 

0.14 

5 

29 

37 

0.44 

0.16 

5 

33 

44 

0.44 

0.18 

5 

44 

53 

0.44 

0.2 

4 

46 

64 

=  .05] 
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P(a7(2))  Cmin  Empirical  Number  of  Casts 
.05] 


Bound 

’  Wa(P(^(1))iP(^^(2)) 
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P(CJ(|))  P{UJf2))  C„„„  Empirical  Number  of  Casts  Bound 

»I(P(t«;„)),P{a;(2i).</  =  ~nAlP(a;(i)).P(t^(2)).</ = -05] 

.05) 
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Empirical  Number  of  Casts 
■051 _ 


3 

51 

3 

70 

0.52 

0.34 

3 

75 

0.52 

0.36 

3 

105 

0.52 

0.38 

3 

145 

0.52 

0.4 

3 

200 

0.52 

0.42 

3 

315 

0.52 

0.44 

3 

500 

0.52 

0.46 

3 

810 

0.54 

0.02 

24 

9 

0.54 

0.04 

13 

10 

0.54 

0.06 

9 

11 

0.54 

0.08 

7 

13 

0.54 

0.1 

6 

15 

0.54 

0.12 

5 

16 

0.54 

0.14 

5 

18 

0.54 

0.16 

4 

21 

0.54 

0.18 

4 

21 

0.54 

0.2 

4 

25 

0.54 

0.22 

4 

25 

0.54 

0.24 

3 

31 

0.54 

0.26 

3 

29 

0.54 

0.28 

3 

37 

0.54 

0.3 

3 

49 

0.54 

0.32 

3 

53 

0.54 

0.34 

3 

73 

0.54 

0.36 

3 

90 

0.54 

0.38 

3 

105 

0.54 

0.4 

3 

140 

0.54 

0.42 

3 

175 

0.54 

0.44 

3 

285 

0.56 

0.02 

23 

9 

0.56 

0.04 

12 

10 

0.56 

0.06 

9 

11 

0.56 

0.08 

7 

11 

0.56 

0.1 

6 

15 

0.56 

0.12 

5 

15 

0.56 

0.14 

5 

15 

0.56 

0.16 

4 

17 

0.56 

0.18 

4 

20 

0.56 

0.2 

4 

21 

0.56 

0.22 

3 

25 

0.56 

0.24 

3 

26 

0.56 

0.26 

3 

31 

0.56 

0.28 

3 

30 

0.56 

0.3 

3 

37 

0.56 

0.32 

3 

47 

Bound 

~fiA(P(CJ(i)),P(tJ(2))-<^  =  .05] 


86 

106 

132 
169 
223 
306 
444 
698 

1245 

9 

11 

12 

14 
16 
19 
21 

24 
28 

32 

37 
44 

51 
60 

72 
87 

107 

133 
171 
225 
308 
446 

9 

10 
II 
13 

15 
17 
19 
22 

25 
29 

33 

38 
44 

52 
61 

73 
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PMdl) 

c 

^min 

Empirical  Number  of  Casts 

.05] 

Bound 

~  nA(P(^^(i) ) .  P(^*-’(2) ) 

0.62 

0.08 

6 

10 

10 

0.62 

0.1 

5 

II 

11 

0.62 

0.12 

5 

10 

13 

0.62 

0.14 

4 

13 

14 

0.62 

0.16 

4 

15 

16 

0.62 

0.18 

4 

13 

18 

0.62 

0.2 

3 

16 

21 

0.62 

0.22 

3 

16 

23 

0.62 

0.24 

3 

18 

27 

0.62 

0.26 

3 

21 

30 

0.62 

0.28 

3 

25 

35 

0.62 

0.3 

3 

27 

40 

0.62 

0.32 

3 

33 

46 

0.62 

0.34 

3 

36 

53 

0.62 

0.36 

3 

44 

63 

0.64 

0.02 

19 

7 

6 

0.64 

0.04 

10 

8 

7 

0.64 

0.06 

7 

9 

8 

0.64 

O.08 

6 

8 

9 

0.64 

O.l 

5 

10 

10 

0.64 

0.12 

4 

12 

12 

0.64 

0.14 

4 

II 

13 

0.64 

0.16 

4 

14 

15 

0.64 

0.18 

3 

13 

17 

0.64 

0.2 

3 

13 

19 

0.64 

0.22 

3 

15 

21 

0.64 

0.24 

3 

16 

24 

0.64 

0.26 

3 

19 

27 

0.64 

0.28 

3 

19 

30 

0.64 

0.3 

3 

27 

35 

0.64 

0.32 

3 

31 

40 

0.64 

0.34 

3 

35 

46 

0.66 

0.02 

18 

6 

6 

0.66 

0.04 

10 

8 

7 

0.66 

0.06 

7 

8 

8 

0.66 

0.08 

6 

8 

8 

0.66 

0.1 

5 

9 

10 

0.66 

0.12 

4 

10 

II 

0.66 

0.14 

4 

10 

12 

0.66 

0.16 

4 

11 

13 

0.66 

0.18 

3 

13 

15 

0.66 

0.2 

3 

14 

17 

0.66 

0.22 

3 

14 

19 

0.66 

0.24 

3 

17 

21 

0.66 

0.26 

3 

16 

24 

0.66 

0.28 

3 

18 

27 

0.66 

0.3 

3 

22 

31 
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P(a;„))  P(u;(2)) 


Empirical  Number  of  Casts 
MA[P(ct;,i)),P(a;(2i).</ 

.05] _ 


Bound 

■nA[P(UJ(f)),P(UJ(2)),d 
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Appendix  K 


A  Modified  Radial  Basis  Function 
Classifier 


The  Gaussian  Radial  Basis  Function  (RBF)  neural  network  architecture  (e.g.,  [18, 95,  104,  92])  employs  the 
following  non-linear  input-to-output  mapping,  where  x  denotes  the  input  to  the  RBF  unit  (or  node),  and 
/(x)  denotes  the  node's  output: 


/(*) 


I 

- r  •  exp 

(2n)f  lEji 

-  — ^  ^ 


/'f  S"' 


(®  -  /O 


(K.1) 


The  notation  denotes  the  transpose  of  vector  z ,  and  m  denotes  the  dimensionality  of  the  input  vector  x . 
The  modified  radial  basis  function  node  is  identical  to  the  standard  RBF  node  with  two  exceptions. 


•  The  covariance  matrix  L  associated  with  each  RBF  node  is  diagonal,  and  all  of  its  diagonal  elements 
have  the  same  value  (i.e.,  the  covariance  matrix  has  orthonormal  eigenvectors  and  all  of  its  eigenvalues 
are  identical).  The  matrix  is  described  by  the  equation 


E  =  ol,  (K.2) 

where  I  denotes  the  identity  matrix  and  o  denotes  the  node’s  single  variance  parameter.  For  this 

reason,  the  modified  RBF  node  has  —  I  fewer  parameters  than  its  standard  counterpart.  In  the 
case  of  an  II -element  input  vector,  the  modified  RBF  node  has  12  parameters  compared  to  the 
standard  RBF  node’s  1 32.  Equation  (K.2)  constrains  the  RBF  node  to  have  hyperspherical  (rather  than 
hyperellipsoidal)  contours  of  constant  value  on  the  domain  of  x 

•  The  standard  RBF’ s  <  term  in  (K.2)  is  eliminated.  This  has  the  effect  of  bounding  the  modified  RBF’s 
output  /(x)  on  the  interval  [0,1]  — much  like  the  logistic  non-linearity.  The  standard  RBF  node’s 
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output,  by  contrast,  is  bounded  on  [O.oo]  and  has  unit  area  over  the  domain  of  x . 

Thus,  the  non-linear  input-to-output  mapping  for  the  modified  ''.BF  unit  is  given  by 

/(sr)  =  exp  j^-^(x  -  /if  1(1  -  //)  ,  (K.3) 

where  1  denotes  the  identity  matrix  and  <r^  denotes  the  node’s  single  variance  parameter.  By  having  —  I 
fewer  parameters  than  its  standard  counterpart,  the  modified  RBF  node  has  significantly  lower  functional 
complexity.  Despite  their  reduced  complexity,  networks  of  modified  RBF  nodes  are  still  capable  of  forming 
a  classifier  with  complex  non-linear  decision  boundaries.  As  a  result,  such  networks  are  well  suited  for 
differential  learning  (e.g.,  see  chapter  10).  We  refer  to  the  differentially-generated  variants  as  Differential 
Radial  Basis  Function  (DRBF)  classifiers. 


Appendix  L 

Anderson  &  Fisher’s  Iris  Data 

L.l  Original  Iris  Data' 


The  following  data  describing  three  varieties  of  Iris  (Iris  virginica.  Iris  versicolor,  and  Iris  setosa)  was 
originally  collected  by  E.  Anderson[3],  and  subsequently  used  by  R.  A.  Fisher  in  his  seminal  paper  on  linear 
discriminants  [34].  The  feature  vector  X  has  four  elements:  jti  denotes  sepal  length,  X2  denotes  sepal 
width,  jr.i  denotes  petal  length,  and  jr4  denotes  petal  width.  There  are  three  classes;  UJ\  denotes  Iris  setosa, 
Ctl2  denotes  Iris  versicolor,  und  U>i  denotes /r»  virgmrca.  The  examples  are  listed  below; 


Exanq^le  Class 

Xl 

x2 

x3 

x4 

sssxstts 

BSS 

BSB 

sss 

sss 

0 

1 

5.1 

3.5 

1.4 

0.2 

1 

1 

4.9 

3.0 

1.4 

0.2 

2 

1 

4.7 

3.2 

1.3 

0.2 

3 

1 

4.6 

3.1 

1.5 

0.2 

4 

1 

5.0 

3.6 

1.4 

0.2 

5 

1 

5.4 

3.9 

1.7 

0.4 

6 

1 

4.6 

3.4 

1.4 

0.3 

7 

1 

5.0 

3.4 

1.5 

0.2 

8 

1 

4.4 

2.9 

1.4 

0.2 

9 

1 

4.9 

3.1 

1.5 

0.1 

10 

1 

5.4 

3.7 

1.5 

0.2 

11 

1 

4.8 

3.4 

1.6 

0.2 

12 

1 

4.8 

3.0 

1.4 

0.1 

13 

1 

4.3 

3.0 

1.1 

0.1 

14 

1 

5.8 

4.0 

1.2 

0.2 

15 

1 

5.7 

4.4 

1.5 

0.4 

IS 

1 

5.4 

3.9 

1.3 

0.4 

17 

1 

5.1 

3.5 

1.4 

0.3 

18 

1 

5.7 

3.8 

1.7 

0.3 

19 

1 

5.1 

3.8 

1.5 

0.3 

20 

1 

5.4 

3.4 

1.7 

0.2 

21 

1 

5.1 

3.7 

1.5 

0.4 

22 

1 

4.6 

3.6 

1.0 

0.2 

23 

1 

5.1 

3.3 

1.7 

0.5 

24 

1 

4.8 

3.4 

1.9 

0.2 

25 

1 

5.0 

3.0 

1.6 

0.2 

26 

1 

5.0 

3.4 

1.6 

0.4 

27 

1 

5.2 

3.5 

1.5 

0.2 

26 

1 

5.2 

3.4 

1.4 

0.2 

'  We  thank  Professor  Casimir  Kulikowski  of  Rutgers  Univeisity  for  providing  us  with  an  electronic  version  of  Anderson/Fisher’i 
original  data. 
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Normalized  Iris  Data 
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99 

2 

5.7 

2.8 

4 . 1 

1.3 

100 

3 

6.3 

3.3 

6.0 

2.S 

101 

3 

5.8 

2.7 

5.1 

1.9 

102 

3 

7.1 

3.0 

5.9 

2.1 

103 

3 

6.3 

2.9 

5.6 

1.8 

104 

3 

6.5 

3.0 

5.8 

2.2 

105 

3 

7.6 

3.0 

6.6 

2.1 

106 

3 

4.9 

2.5 

4.5 

1.7 

107 

3 

7.3 

2.9 

6.3 

1.8 

108 

3 

6.7 

2.5 

5.8 

1.8 

109 

3 

7.2 

3.6 

6.1 

2.5 

110 

3 

6.5 

3.2 

5.1 

2.0 

111 

3 

6.4 

2.7 

5.3 

1.9 

112 

3 

6.8 

3.0 

5.5 

2.1 

113 

3 

5.7 

2.5 

5.0 

2.0 

114 

3 

5.8 

2.8 

5.1 

2.4 

115 

3 

6.4 

3.2 

5.3 

2.3 

116 

3 

6.5 

3.0 

5.5 

1.8 

117 

3 

7.7 

3.8 

6.7 

2.2 

118 

3 

7.7 

2.6 

6.9 

2.3 

119 

3 

6.0 

2.2 

5.0 

1.5 

120 

3 

6.9 

3.2 

5.7 

2.3 

121 

3 

5.6 

2.8 

4.9 

2.0 

122 

3 

7.7 

2.8 

6.7 

2.0 

123 

3 

6.3 

2.7 

4.9 

1.8 

124 

3 

6.7 

3.3 

5.7 

2.1 

125 

3 

7.2 

3.2 

6.0 

1.8 

126 

3 

6.2 

2.8 

4.8 

1.8 

127 

3 

6.1 

3.0 

4.9 

1.8 

128 

3 

6.4 

2.8 

5.6 

2.1 

129 

3 

7.2 

3.0 

5.8 

1.6 

4.30 

3 

7.4 

2.8 

6.1 

1.9 

131 

3 

7.9 

3.8 

6.4 

2.0 

132 

3 

6.4 

2.8 

5.6 

2.2 

133 

3 

6.3 

2.8 

5.1 

1.5 

134 

3 

6.1 

2.6 

5.6 

1.4 

135 

3 

7.7 

3.0 

6.1 

2.3 

136 

3 

6.3 

3.4 

5.6 

2.4 

137 

3 

6.4 

3.1 

5.5 

1.8 

138 

3 

6.0 

3.0 

4.8 

1.8 

139 

3 

6.9 

3.1 

5.4 

2.1 

140 

3 

6.7 

3.1 

5.6 

2.4 

141 

3 

6.9 

3.1 

5.1 

2.3 

142 

3 

5.8 

2.7 

5.1 

1.9 

143 

3 

6.8 

3.2 

5.9 

2.3 

144 

3 

6.7 

3.3 

5.7 

2.5 

145 

3 

6.7 

3.0 

5.2 

2.3 

146 

3 

6.3 

2.5 

5.0 

1.9 

147 

3 

6.5 

3.0 

5.2 

2.0 

148 

3 

6.2 

3.4 

5.4 

2.3 

149 

3 

5.9 

3.0 

5.1 

1.8 

L.2  Normalized  Iris  Data 


The  following  data  was  computed  via  an  affine  transfonnation  of  the  original  data.  We  determined  the  lower 
and  upper  bound  on  each  of  the  four  feature  vector  elements  {(/i,hi)  .  •  ■  •  t  (/4.«4)}  ;  we  then  transformed 
each  element  of  the  150  vectors  as  follows 


x!' 


2  -  ^»  ) 
(«.  -  A) 


(L.1) 


where  xj  denotes  the  /th  element  of  example  and  xj'  denotes  the  post-transformation  value  of  x/.This 
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affine  transformation  normalizes  each  element  of  the  N  =  4-dimensionaI  feature  vector  X  to  the  closed 
interval  1-1,1).  That  is.  the  affine  transformation  projects  the  original  feature  vector  X  €  S?"*  onto  the 
normalized  vector  X'  €  [- 1 , 1  ^ .  Tlie  normalized  data  are  given  below: 


Example 

Class 

xl 

x2 

x3 

x4 

0 

1 

-0.555556 

0.250000 

-0.864407 

-0.916667 

1 

1 

-0.666667 

-0.166667 

-0.864407 

-0.916667 

2 

1 

-0.77777P 

0.000000 

-0.898305 

-0.916667 

3 

1 

-0.833333 

-0.083333 

-0.830508 

-0.916667 

4 

1 

-0.611111 

0.333333 

-0.864407 

-0.916667 

5 

1 

-0.388889 

0.583333 

-0.762712 

-0.750000 

e 

1 

-0.833333 

0.166667 

-0.864407 

-0.833333 

7 

1 

-0.611111 

0.166667 

-0.830508 

-0.916667 

R 

1 

-0.944444 

-0.250000 

-0.864407 

-0.916667 

9 

1 

-0.666667 

-0.083333 

-0.830508 

-1.000000 

10 

1 

-0.38R889 

0.416667 

-0.830508 

-0.916667 

11 

1 

-0.722222 

0.166667 

-0.796610 

-0.916667 

12 

1 

-0.722222 

-0.166667 

-0.864407 

-1.000000 

13 

1 

-1.000000 

-0.166667 

-0.966102 

-1.000000 

14 

1 

-0.166667 

0.666667 

-0.932203 

-0.916667 

15 

1 

-0.222222 

1.000000 

-0.830508 

-0.750000 

16 

1 

-0.388889 

0.583333 

-0.898305 

-0.750000 

17 

1 

-0.555556 

0.250000 

-0.864407 

-0.833333 

18 

1 

-0.222222 

0.500000 

-0.762712 

-0.833333 

19 

1 

-0.555556 

0.500000 

-0.830508 

-0.833333 

20 

1 

-0.388889 

0.166667 

-0.762712 

-0.916667 

21 

1 

-0.555556 

0.416667 

-0.830508 

-0.750000 

22 

1 

-0.833333 

0.333333 

-1.000000 

-0.916667 

23 

1 

-0.555556 

0.083333 

-0.762712 

-0.666667 

24 

1 

-0.722222 

0.166667 

-0.694915 

-0.916667 

25 

1 

-0.611111 

-0.166667 

-0.796610 

-0,916667 

26 

1 

-0.611111 

0.166667 

-0.796610 

.  -0.750000 

27 

1 

-0.500000 

0.250000 

-0.830508 

-0.916667 

28 

1 

-0.500000 

0.166667 

-0.864407 

-0.916667 

29 

1 

-0.777778 

0.000000 

-0.796610 

-0.916667 

30 

1 

-0.722222 

-0.083333 

-0.796610 

-0.916667 

31 

1 

-0.388889 

0.166667 

-0.830508 

-0.750000 

32 

1 

-0.500000 

0.750000 

-0.830508 

-1.000000 

33 

1 

-0.333333 

0.833333 

-0.864407 

-0.916667 

34 

1 

-0.666667 

-0.083333 

-0.830508 

-0.916667 

35 

1 

-0.611111 

0.000000 

-0.932203 

-0.916667 

36 

1 

-0.333333 

0.250000 

-0.898305 

-0.916667 

37 

1 

-0.666667 

0.333333 

-0.864407 

-1.000000 

38 

1 

-0.944444 

-0.166667 

-0.898305 

-0.916667 

39 

1 

-0.555556 

0.166667 

-0.830508 

-0.916667 

40 

1 

-0.611111 

0.250000 

-0.898305 

-0.833333 

41 

1 

-0.888889 

-0.750000 

-0.898305 

-0.833333 

42 

1 

-0.944444 

0.000000 

-0.898305 

-0.916667 

43 

1 

-0.611111 

0.250000 

-0.796610 

-0.583333 

44 

1 

-0.555556 

0.500000 

-0.694915 

-0.750000 

45 

1 

-0.722222 

-0.166667 

-0.864407 

-0.833333 

46 

1 

-0.555556 

0.500000 

-0.796610 

-0.916667 

47 

1 

-0.833333 

0 . 000000 

-0.864407 

-0.916667 

48 

1 

-0.444444 

0.416667 

-0.830508 

-0.916667 

49 

1 

-0.611111 

0.083333 

-0.864407 

-0.916667 

50 

2 

0.500000 

0.000000 

0.254237 

0.083333 

51 

2 

0.166667 

0.000000 

0.186441 

0.166667 

52 

2 

0.444444 

-0.083333 

0.322034 

0.166667 

53 

2 

-0.333333 

-0.750000 

0.016949 

0.000000 

54 

2 

0.222222 

-0.333333 

0.220339 

0.166667 

55 

2 

-0.222222 

-0.333333 

0.186441 

0.000000 

56 

2 

0.111111 

0.083333 

0.254237 

0.250000 

57 

2 

-0.666667 

-0.666667 

-0.220339 

-0.250000 
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Complexity  Reduction  Techniques 


We  describe  our  implementation  of  three  techniques  for  reducing  the  classifier's  complexity; 

•  Weight  decay. 

•  Weight  smoothing. 

•  Linear  non-invertible  feature  vector  compression 

The  number  of  parameters  is  the  implicit  measure  of  discriminator  complexity  in  all  three  techniques.  The 
Tint  two  techniques  work  by  reducing  what  Moody  calls  the  effective  number  of  parameters  [97]  in  the 
discriminaior;  the  third  technique  reduces  the  actual  number  of  parameters.  By  reducing  the  number  of 
parameters  (both  actual  and  effective),  we  reduce  the  classifier’s  discriminant  variance. 

Although  these  techniques  are  the  only  ones  we  use  in  the  experiments  of  part  II,  many  other  useful  ones 
exist. 


M.1  Weight  Decay 

We  employ  the  weight-decay  formalism  described  by  Hanson  and  Pratt  [57],  in  which  parameters  (i.e., 
weights)  decay  to  a  value  of  zero  in  the  absence  of  learning  influences.  This  form  of  complexity  reduction 
can  be  used  with  any  type  of  parameter  vector  for  which  setting  an  element  to  zero  is  equivalent  to  removing 
the  element  from  the  vector.'  Assuming  a  learning  procedure  that  updates  each  parameter  0  iteratively,  tire 
notation  0[q]  denotes  the  parameter  value  at  the  r^h  update.  By  this  notation,  the  value  of  the  parameter  at 
the  beginning  of  learning  is  6*[0] .  At  the  qth  update,  0|v]  is  equal  to  a  fraction  of  its  value  after  the  previous 
update  plus  the  parameter  change  A6)[i;]  provided  by  the  iterative  learning  procedure: 

'  Hanson  and  Pratl’s  foimalism  can  be  generalized  lo  one  in  which  ihe  {Mrameters  decay  lo  a  potentially  non-zero  null  value  in  the 
absence  of  learning  influences.  Such  a  generalization  would  make  the  technique  applicable  lo  redial  basis  function  (RBF)  classifiers,  as 
an  example,  for  which  zero-valued  parameters  are  generally  not  null,  but  have  an  effect  on  the  discriminator's  mappings. 
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Figure  M.  I :  Left:  The  parameters  of  a  ]  024-pixel  differentially-generated  logistic  linear  classifier  described 
in  chapter  9,  generated  by  differential  learning  v«  ithout  weight  smoothing  or  weight  decay.  Light  parameters, 
or  weights  are  positive;  dark  weights  are  negative;  the  gray  tone  in  (all  but  the  vertically  centered  pixel  oO 
the  display’s  left  edge  represents  the  value  zero.  Right:  A  histogram  of  the  weights  in  the  left  figure.  Note 
the  entropy  of  the  weights  is  2.8. 


4,1  =  (I  -  <)%-!) -f  A%]  (M.1) 

The  decay  rate  (,  £  [0,1)  determines  how  fast  the  parameter  decays  to  zero.  Equation  (M.  I )  is  a  first-order 
difference  equation  with  the  forcing  function  A9[r,l .  If  we  set  the  initial  condition  9\—  I  ]  =  0 ,  then 
9[0]  =  Ad[0] ,  and  the  solution  to  the  difference  equation  is  given  by 


n 

«[»/]  =  (1  -  O^^IO)  +  5^(1  -  C)’'"*A»W  V»,  >  0  (M.2) 

t=i 

Thus,  the  parameter’s  dependence  on  its  initial  value  decreases  exponentially  as  learning  progresses  (i.e., 
as  >/  increases).  Likewise,  the  parariKter’s  dependence  on  prior  updates  decreases  exponentially.  As 
C  I .  decay  is  very  rapid;  as  C  0 ,  decay  is  very  slow.  After  a  large  number  of  iterations 
(i.e.,  f,  >  I )  the  parameter’s  value  effectively  depends  solely  on  the  sequence  of  past  learning  updates 
( A9(r,| ,  A9[>/  —  I] , . . . ).  Specifically, 

n 

4j1  «  5Z(1  -  »  0,  C  >  0,  (M.3) 

»=i 


which  means  that  the  parameter  value  can  be  viewed  as  the  output  of  a  first-order  auto-regressive  (AR)  filter 
operating  on  the  learning  procedure's  parameter  update  sequence. 


M.  /  Weight  Decay 
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Figure  M.2:  Left;  The  parameters  of  the  logistic  linear  classifier  shown  in  figure  M.l,  generated  by 
differential  learning  with  weight  decay.  Right:  A  histogram  of  the  weights  in  the  left  figure.  Note  the  entropy 
of  the  weights  is  now  1 .6,  reflecting  the  lower  variance  in  their  distribution  engendered  by  weight  decay.  This 
lower  variance/entropy  accounts  for  the  low-contrast  in  the  weight  display  on  the  left;  many  of  the  weights 
have  decayed  to  zero. 


M.1.1  Parametric  Entropy 

Figure  M.l  (left)  illustrates  the  weights  of  a  linear  classifier  with  a  single  output  unit,  descriLid  in  chapter  9. 
The  classifier  is  used  to  diagnose  a  joint  disorder  in  magnetic  resonance  images,  so  its  input  is  retinotopic 
(i.e.,  image-like)  and  its  weights  can  themselves  be  displayed  as  an  image.  The  weights  are  generated  by 
differential  learning  without  weight  decay  or  weight  smoothing.  The  lighter  weights  have  positive  values;  the 
darker  weights  have  negative  values;  the  gray  tone  in  (all  but  the  vertically  centered  pixel  oO  the  display’s 
left  edge  represents  the  value  zero.  Figure  M.l  (right)  shows  a  histogram  of  the  weight  values. 

If  we  view  all  the  parameters  as  realizations  of  a  single  random  variable  (rv),  their  histogram  can  be 
loosely  interpreted  as  an  indicator  of  the  rv’s  information  content  —  if  all  the  parameters  have  the  same 
value,  they  don’t  contain  any  information  about  the  patterns  that  the  discriminator  classifies;  if  they  have 
widely  varying  (e.g.,  uniformly-distributed)  values,  they  probably  do  contain  information. 

Definition  M.1  Parametric  Entropy;  The  entropy  of  a  parameter  vector  9  is  based  on  a  50-bin 
histogram  of  its  values,  and  is  simply 

so 

-^Pi  >Og2(Pi)  - 

t=l 


where  pi  denotes  the  fraction  of  parameter  vector  elements  with  values  that  fall  within  the  ith  histogram 
bin.  The  histogram  spans  the  range  of  the  largest  integer  not  greater  than  the  most  negative  parameter  to  the 
smallest  integer  not  less  than  the  most  positive  parameter:  [b(>t[l]J  =  [minj^tJ  &  fW/r[50]l  =  fmax*6tt]. 
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Remark:  We  stress  that  this  definition  is  an  intuitive  and  rather  arbitrary  one,  neither  rigorously 
substantiated  nor  generally  applicable.  It  is  restricted  to  parameter  vectors  associated  with  retinotopic 
feature  vectors  (e.g.,  images,  speech  spectrograms,  etc.)  because  it  assumes  all  parameters  are  realizations 
of  the  same  single  random  parameter  variable.  This  assumption  is  plausible  for  image-like  feature  vectors 
because  they  tend  to  have  a  large  number  of  spatially  correlated  elements.  As  a  result,  discriminator 
parameter  vectors  associated  with  the  feature  vector  tend  to  have  an  equally  large  number  of  spatially 
correlated  elements.  Where  the  correlation  is  low  between  parameters,  there  tend  to  be  details  in  the  feature 
vector  that  are  critical  to  the  classification  process.  Of  course,  uncorrelated  parameters  tend  to  have  higher 
variance,  so  their  associated  parametric  entropy  is  higher.  Thus,  parametric  entropy  is  a  convenient  albeit 
ad-hoc  measure  of  the  parameter  vector's  information  content. 

The  parametric  entropy  of  the  weight  vector  in  figure  M. I,  generated  without  weight  decay  or  weight 
smoothing,  is  2.8.  The  parametric  entropy  of  the  weight  vector  in  figure  M.  I ,  generated  with  a  weight  decay 
rate  of  (  =  .005 ,  is  1 .6.  The  lower  parametric  entropy  is  evident  when  one  compares  the  two  weight 
displays  and  their  associated  histograms:  the  decayed  weight  distribution  has  less  variance  and  more  kurtosis 
(i.e.,  the  histogram  peaks  more  sharply  about  zero)  —  both  related  to  lower  parametric  entropy.  There  is 
visibly  less  structure  in  the  decayed  weights. 


M.2  Weight  Smoothing 

We  employ  a  simple  form  of  weight  smoothing  developed  by  Pomerleau^  in  which  the  parameter  vector 
is  filtered  after  each  update.  This  form  of  complexity  reduction  is  restricted  to  weight  vectors  associated 
with  retinotopic  feature  vectors  because  it  relies  on  the  assumption  that  “neighboring”  parameters  (i.e., 
those  corre.sponding  to  neighboring  pixels  in  the  feature  Vector)  can  be  highly  correlated  without  increasing 
discriminant  error. 

We  arrange  the  parameter  vector  in  a  manner  that  reflects  the  feature  vector  pixel  map  with  which  it 
is  associated  (the  weight  displays  of  figures  M.l  and  M.2  exemplify  such  an  arrangement).  We  convolve 
a  simple  moving  average  (MA)  filter  with  this  parameter  map.  The  filter  perturbs  each  parameter  towards 
the  average  value  of  it  and  its  eight  neighboring  parameters  in  the  map.  Figure  M.3  illustrates  the  filter's 
kernel  function.  The  gray  box  at  the  center  represents  So ,  the  parameter  being  processed  at  a  given  point  in 

the  filtering  convolution;  the  surrounding  boxes  represent  this  parameter's  neighbors  {0| . 08 }  in  the 

parameter  map.  The  kernel  omits  the  appropriate  parameters  (with  a  commensurate  modification  to  (M.5) 
below)  when  they  do  not  exist  (e.g.,  for  So  parameters  on  the  edge  of  the  map).  We  denote  the  output  of  the 
filter  by  Sq  when  the  kernel  is  centered  on  So : 


^Personal  communication. 
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Figure  M.3:  The  moving  average  filter  kernel  used  for  weight  smoothing.  The  gray  parameter  Oq  is  the 
principal  input  to  the  filter;  its  neighboring  parameters  compri.se  the  other  terms  in  the  moving  average  of 
(M.5). 


(M.5) 


The  parameter  k  €  (0,1 )  adjusts  the  level  of  smoothing.  The  filter  is  convolved  with  the  parameter  map 
after  each  update  of  the  parameter  vector. 

Figure  M.4  (left)  illustrates  weights  generated  by  differential  learning  with  weight  smoothing  (  k  =  0.05 ). 
The  parametric  entropy  of  the  weight  vector  in  figure  M.l,  generated  without  weight  decay  or  weight 
smoothing,  is  2.8.  The  parametric  entropy  of  the  weight  vector  in  figure  M.4  is  2.2.  The  lower  parametric 
entropy  is  evident  when  one  compares  the  two  weight  displays  and  their  associated  histograms:  the  smoothed 
weight  distribution  has  less  variance  and  more  kurtosis.  There  is  visibly  less  structure  in  the  smoothed 
weights,  and  they  appear  blurred  as  a  result  of  the  iterative  filtering  operation. 


MJ  Linear  Non-Invertible  Feature  Vector  Compression 

We  employ  a  simple  form  of  lossy  (i.e.,  non-invertible)  data  compression  as  a  third  approach  to  complexity 
reduction.  Like  weight  smoothing,  it  is  restricted  to  weight  vectors  associated  with  retinotopic  feature  vectors 
because  it  relies  on  the  assumption  that  ‘'neighboring"  elem  iits  (i.e.,  those  corresponding  to  neighboring 
pixels  in  the  feature  vector  map)  are  highly  correlated  and,  as  a  result,  redundant. 

The  compression  ratio  (CR)  is  the  ratio  of  the  number  of  elements  in  the  original  feature  vector  to  the 
number  of  elements  in  the  compressed  feature  vector.  Figure  M.5  illustrates  compression  when  the  ratio  CR 
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Figure  M.4;  Left:  The  parameters  of  the  logistic  linear  classiHer  shown  in  figure  M.  1 ,  generated  by  differential 
learning  with  weight  smoothing.  Right:  A  histogram  of  the  weights  in  the  left  figure.  Note  the  entropy  of  the 
weights  is  2.2,  compared  with  2.8  for  the  weights  generated  without  smoothing.  The  lower  entropy  reflects 
the  lower  variance  in  the  distribution  of  weights  caused  by  weight  smoothing,  which  accounts  for  the  lower 
contrast  in  this  weight  display  compared  to  the  one  in  figure  M.2.  Note  the  blurred  appearance  of  the  weights 
due  to  the  filtering  effect  of  weight  smoothing. 


is  an  integer  (top),  and  when  the  it  is  not  an  integer  (bottom).  For  the  case  in  which  the  compression  ratio 
is  an  integer,  the  compressed  element  of  X,  which  we  denote  by  x' ,  is  equal  to  the  average  value  of  the 
elements  in  X  from  which  it  is  formed;  we  denote  these  elements  by  {jC| ,  . . .  ,.rcR}  ^ 


(M.6) 


In  figure  M.5  (top)  CR  =  4 ,  and  each  x'  in  the  compressed  image  is  the  average  of  four  pixels  in  the 
original  image  (i.e.,  the  average  of  four  elements  in  the  original  feature  vector). 

For  the  case  in  which  the  compression  ratio  is  not  an  integer,  x'  is  proportional  to  the  value  of  the 
elements  in  X  from  which  it  is  formed.  Figure  M.5  (bottom)  illustrates  compression  when  CR  =  2.25.  In 
this  case,  every  element  of  the  compressed  feature  vector  is  formed  from  four  elements  in  the  original  vector. 
As  an  example,  the  lower  right  element  of  the  compressed  vector,  which  we  will  denote  by  V ,  is  given  by 


CR  =  2.25 


1  I  1 

2  -<^1  +  ^  -'z  +  2 


(M.7) 


The  subscripts  of  {.ri ,  . . .  ,jr4}  in  (M.7)  arc  shown  in  the  lef;  side  of  t*  :  firure,  which  depicts  the  original 
feature  vector  elements. 

Figures  8.1  and  8.5  in  chapter  8  and  figures  9.1  and  9.3  in  chapter  9  illustrate  the  effects  of  linear 
non-invertible  compression  for  two  different  retinotopic  feature  vectors.  The  compression  ratio  in  both  tasks 


M.3  Compression 


4J9 


Figure  M.5:  Top:  Linear  non-invertible  compression  with  a  compression  ratio  of  4- ! .  The  value  of  each  pixel 
in  the  compressed  image  is  equal  to  the  average  value  of  the  four  constituent  pixels  in  the  original  image. 
Bottom:  Linear  non-invertible  compression  with  a  compression  ratio  of  2.25: 1 .  The  constituent  pixels  in  the 
original  image  contribute  to  the  compressed  image  in  proportion  to  the  fraction  of  their  area  that  falls  within 
the  bounds  of  the  compressed  pixel  (see  equation  (M.7)). 


is  4:1.  We  characterize  the  compression  as  non-invertible  because  the  original  feature  vector  cannot  be 
derived  from  the  compressed  vector. 


M.3.1  A  Brief  Argument  Against  Principal  Components  Analysis 

Readers  familiar  with  the  method  of  principal  components  analysis  (PCA)  might  wonder  why  we  do  not 
employ  this  technique.  In  its  details  the  reason  is  rather  long-winded,  so  we  give  only  a  brief  explanation. 
Principal  components  analysis  relies  on  the  following  assumptions: 

•  A  feature  vector  X’s  first  and  second  moments  arc  assumed  to  be  sufficient  statistics  for  the  pattern 
recognition  task,  to  the  extent  (hat  the  following  assumption  holds: 

•  The  feature  vector's  covariance  matrix  can  be  expressed  in  terms  of  its  eigenvectors  and  eigenvalues. 
The  eigenvectors  associated  with  the  largest  eigenvalues  (i.e..  X’s  principal  components)  contain  the 
bulk  of  X's  variance,  and  as  a  result,  they  are  assumed  to  contain  the  bulk  of  the  information  necessary' 
for  separating  the  class-conditional  densities  of  X  in  feature  vector  space  “X  ■ 
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Appendix  M:  Complexity  Reduction 


Fukunaga  has  writlen  extensively  on  this  topic;  we  refer  the  reader  to  [40,  ch’s.  9-10].  The  second 
assumption  above  is  one  that  is  frequently  violated,  as  eloquently  and  succinctly  described  in  [40,  pp. 
442-443].  In  short,  under  certain  circumstances  the  Mii/ior  components  of  X  (i.e.,  the  eigenvectors  associated 
with  the  smallest  eigenvalues)  will  contain  the  the  bulk  of  the  information  necessary  for  separating  the 
class-conditional  densities  of  X ;  principal  components  analysis  would  discard  these  very  components. 
Under  similar  circumstances,  the  information  necessary  for  robust  discrimination  of  X  is  distributed  across 
all  its  elements,  so  eliminating  any  of  them  would  result  in  higher  discriminant  bias.  As  a  result,  we  eschew 
all  but  the  crudest  and  most  general  form  of  feature  vector  dimensionality  reduction;  the  linear  non-invertible 
compression  described  above.  The  circumstances  under  which  it  is  applicable  arc  obvious  (the  feature  vector 
must  be  retinotopic  in  nature),  so  there  is  iittle  danger  of  applying  the  technique  when  it  is  inappropriate.  The 
only  danger  is  using  a  compression  ratio  that  is  so  high,  information  essential  to  robust  discrimination  is  lost 
(see,  for  example,  chapter  9). 
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